The US College Scorecard dataset provides everyday folks, like myself, access to a rich source of US college information. Yet, it’s quite a challenge making sense out of such a sprawling dataset. And even moreso to make sense out of it in a way that captures the specific context of a student’s interest in determining which colleges are good candidates to attend or not.
Before jumping into the analysis, I’d like to acknowledge those folks who generously shared their scripts on Kaggle.com from which I learned – and outright copied – code snips to get this done. In particular, the code provided by Ben Hamner on using package ‘RSQLite’ and by Tad Dallas on using package ‘leaflet’ for map plotting. Thank you!!!
I set out to build a decision analysis tool that would be useful to an individual, like me, for judging the suitability of colleges specifically attuned to my criteria. The analysis is based upon a model that combines the US College Scorecard data and my personal insights – i.e., domain expertise – into something useful for the following:
This is still but a preliminary proof-of-concept level of analysis to illustrate the possibilities with this framework. Ideally, folks will see this, revise, complete and extend it.
My modeling plan looks like this:
I’ve done a ton of pre-processing, plotting and data transformations, even unsupervised learning, to explore the data. I’m going to skip the presentation of those exploration pieces here but instead just show the data winnowing and wrangling I did in preparation for modeling.
Let’s start by loading the packages, downloading & installing them if necessary:
library('RSQLite') # for SQLite manipulation of database
## Warning: package 'RSQLite' was built under R version 3.4.1
library('magrittr') # for piping with infix %>%, %<>%, %T>% (for cool, sophistication effect)
library('tidyr') # for reshaping data (e.g., 'gather')
## Warning: package 'tidyr' was built under R version 3.4.1
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
library('dplyr') # for data wrangling (e.g., 'filter', 'select', 'summarize', 'inner_join')
## Warning: package 'dplyr' was built under R version 3.4.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library('ggplot2') # for best-in-class visualization
## Warning: package 'ggplot2' was built under R version 3.4.1
library('gridExtra') # for grid layout of ggplots
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# library('poLCA') # for estimation of latent classes underlying categorical variables
library('bnlearn') # for learning BBN from datasets, and also provides function "discretize"
## Warning: package 'bnlearn' was built under R version 3.4.1
##
## Attaching package: 'bnlearn'
## The following object is masked from 'package:stats':
##
## sigma
# library('gRain') # for Bayesian belief network (BBN) inference engine
library('leaflet') # for mapping of the top schools for each student
library('htmltools') # for assistance with leaflet
library('RColorBrewer') # for color palette management
After orienting myself a little with the data documentation,…, I started by borrowing code from Ben Hamner’s “Exploring the US College Scorecard Data” script:
db <- dbConnect(dbDriver("SQLite"), "../input/database.sqlite")
# This stops SQLite from writing temp files to disk, which use all the available space
dbGetQuery(db, "PRAGMA temp_store=2;")
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
## data frame with 0 columns and 0 rows
tables <- dbGetQuery(db, "SELECT Name FROM sqlite_master WHERE type='table'")
colnames(tables) <- c("Name")
tables %<>%
rowwise() %>%
mutate(RowCount=dbGetQuery(db, paste0("SELECT COUNT(Id) RowCount FROM ", Name))$RowCount[1])
## Warning: package 'bindrcpp' was built under R version 3.4.1
show(tables)
## Source: local data frame [1 x 2]
## Groups: <by row>
##
## # A tibble: 1 x 2
## Name RowCount
## <chr> <int>
## 1 Scorecard 124699
allFields <- dbListFields(db,tables$Name[1])
show(length(allFields))
## [1] 1731
I also downloaded the data dictionary and relied upon it heavily to understand the database contents. Here I read it in and “revise” it by getting rid of the strange characters that were meant to be apostrophe’s and, more importantly, filled in the blank cells of its tabular format so it shows up like a standard data.frame useful for automated processing:
if(!exists("dataDict")) {
if(!file.exists("CollegeScorecardDataDictionary-09-12-2015_REVISED.csv")) {
# JUST ASSUMES THAT FILE "CollegeScorecardDataDictionary-09-12-2015.csv"
# DOES EXIST IN THE WORKING DIRECTORY:
dataDict <- read.csv(file="../input/CollegeScorecardDataDictionary-09-12-2015.csv",
stringsAsFactors = FALSE)
show(head(dataDict))
str(dataDict)
show(setdiff(allFields,dataDict$VARIABLE.NAME))
show(setdiff(dataDict$VARIABLE.NAME,allFields))
dataDict <- dataDict %>% mutate(LABEL=gsub("^(.*)ââ¬â¢(.*)$","\\1'\\2",LABEL))
# Fill-down
fillDown <- function(df=dataDict,fillCols=c(1:5,8)) {
nr <- nrow(dataDict)
irow <- 1
while(irow < nr){
while(df$NAME.OF.DATA.ELEMENT[irow] == ""){
df[irow,fillCols] <- df[irow-1,fillCols]
irow <- irow + 1
}
irow <- irow + 1
}
return(df)
}
dataDict <- dataDict %>% fillDown(c(1:5,8))
# write.csv(dataDict,file="CollegeScorecardDataDictionary-09-08-2015_REVISED.csv",
# row.names = FALSE,na="")
} else {
dataDict <- read.csv(file="../input/CollegeScorecardDataDictionary-09-12-2015_REVISED.csv",
stringsAsFactors = FALSE)
}
}
## NAME.OF.DATA.ELEMENT dev.category developer.friendly.name
## 1 Unit ID for institution root id
## 2 8-digit OPE ID for institution root ope8_id
## 3 6-digit OPE ID for institution root ope6_id
## 4 Institution name school name
## 5 City school city
## 6 State postcode school state
## API.data.type VARIABLE.NAME VALUE LABEL SOURCE NOTES
## 1 integer UNITID NA IPEDS
## 2 integer OPEID NA IPEDS
## 3 integer opeid6 NA IPEDS
## 4 autocomplete INSTNM NA IPEDS
## 5 autocomplete CITY NA IPEDS
## 6 string STABBR NA IPEDS
## 'data.frame': 1953 obs. of 9 variables:
## $ NAME.OF.DATA.ELEMENT : chr "Unit ID for institution" "8-digit OPE ID for institution" "6-digit OPE ID for institution" "Institution name" ...
## $ dev.category : chr "root" "root" "root" "school" ...
## $ developer.friendly.name: chr "id" "ope8_id" "ope6_id" "name" ...
## $ API.data.type : chr "integer" "integer" "integer" "autocomplete" ...
## $ VARIABLE.NAME : chr "UNITID" "OPEID" "opeid6" "INSTNM" ...
## $ VALUE : int NA NA NA NA NA NA NA NA NA NA ...
## $ LABEL : chr "" "" "" "" ...
## $ SOURCE : chr "IPEDS" "IPEDS" "IPEDS" "IPEDS" ...
## $ NOTES : chr "" "" "" "" ...
## [1] "Id" "C200_4" "C200_L4" "D200_4"
## [5] "D200_L4" "C200_4_POOLED" "C200_L4_POOLED" "poolyrs200"
## [9] "D200_4_POOLED" "D200_L4_POOLED" "Year"
## [1] ""
print(head(dataDict))
## NAME.OF.DATA.ELEMENT dev.category developer.friendly.name
## 1 Unit ID for institution root id
## 2 8-digit OPE ID for institution root ope8_id
## 3 6-digit OPE ID for institution root ope6_id
## 4 Institution name school name
## 5 City school city
## 6 State postcode school state
## API.data.type VARIABLE.NAME VALUE LABEL SOURCE NOTES
## 1 integer UNITID NA IPEDS
## 2 integer OPEID NA IPEDS
## 3 integer opeid6 NA IPEDS
## 4 autocomplete INSTNM NA IPEDS
## 5 autocomplete CITY NA IPEDS
## 6 string STABBR NA IPEDS
print(dataDict %>% tbl_df)
## # A tibble: 1,953 x 9
## NAME.OF.DATA.ELEMENT dev.category
## <chr> <chr>
## 1 Unit ID for institution root
## 2 8-digit OPE ID for institution root
## 3 6-digit OPE ID for institution root
## 4 Institution name school
## 5 City school
## 6 State postcode school
## 7 ZIP code school
## 8 Accreditor for institution school
## 9 URL for institution's homepage school
## 10 URL for institution's net price calculator school
## # ... with 1,943 more rows, and 7 more variables:
## # developer.friendly.name <chr>, API.data.type <chr>,
## # VARIABLE.NAME <chr>, VALUE <int>, LABEL <chr>, SOURCE <chr>,
## # NOTES <chr>
A key concept in the probabilistic Bayesian modeling framework I’ve chosen is to represent each college by a set of variables quantifying the types of students who populate their campus. This requires turning as many of the provided variables as possible into conditional probability distributions such as P(student-income > x | college = school-of-interest). From these conditional probabilities,
I then computed Bayes Factors: log10(P(student-income > x | college = school-of-interest)/P(student-income > x | college != school-of-interest)) = Posterior-Odds(college = school-of-interest | student-income>x)/Prior-Odds(college = school-of-interest).
These Bayes factors then serve as the covariates in the linear utility function of the discrete choice model: u(school-of-interest | student) = beta(student) * Bayes-factors(school-of-interest). (See Nobel Laureate for Economics Daniel McFadden’s contributions in discrete choice modeling.)
Then the student can just rank-order the colleges by their utilities u(school-of-interest | student).
Now, there are many alternatives to this framework. I chose this one because of its flexibility and my experience using such models. And even though I don’t do a rigorous data-estimated hierarchical model using tools such as Stan through package rstan, the promise of the framework is hinted at in the flexibility of the subsequent analyses.
Let’s get on with the work!
Grab the requisite datasets: Used years 2005 (for the Treasury data on students’ originating zip codes) and 2013.
# This groups the disciplines (degrees) into reasonably similar sets.
#discgrp <- read.csv("discgrp.csv", stringsAsFactors=FALSE)
discgrp <- data_frame(LABEL=c("Agriculture, Agriculture Operations, and Related Sciences","Natural Resources and Conservation",
"Architecture and Related Services","Area, Ethnic, Cultural, Gender, and Group Studies",
"Communication, Journalism, and Related Programs","Communications Technologies/Technicians and Support Services",
"Computer and Information Sciences and Support Services","Personal and Culinary Services",
"Education","Engineering","Engineering Technologies and Engineering-Related Fields",
"Foreign Languages, Literatures, and Linguistics","Family and Consumer Sciences/Human Sciences",
"Legal Professions and Studies","English Language and Literature/Letters",
"Liberal Arts and Sciences, General Studies and Humanities","Library Science","Biological and Biomedical Sciences",
"Mathematics and Statistics","Military Technologies and Applied Sciences","Multi/Interdisciplinary Studies",
"Parks, Recreation, Leisure, and Fitness Studies","Philosophy and Religious Studies","Theology and Religious Vocations",
"Physical Sciences","Science Technologies/Technicians","Psychology",
"Homeland Security, Law Enforcement, Firefighting and Related Protective Services",
"Public Administration and Social Service Professions","Social Sciences","Construction Trades",
"Mechanic and Repair Technologies/Technicians","Precision Production","Transportation and Materials Moving",
"Visual and Performing Arts","Health Professions and Related Programs",
"Business, Management, Marketing, and Related Support Services","History"),
discrp=c(2,4,2,5,6,4,1,4,5,1,1,5,5,7,5,5,5,2,1,4,5,4,5,5,2,3,3,4,3,3,4,4,4,4,6,2,7,5),
VARIABLE.NAME=c("PCIP01","PCIP03","PCIP04","PCIP05","PCIP09","PCIP10","PCIP11","PCIP12","PCIP13",
"PCIP14","PCIP15","PCIP16","PCIP19","PCIP22","PCIP23","PCIP24","PCIP25","PCIP26",
"PCIP27","PCIP29","PCIP30","PCIP31","PCIP38","PCIP39","PCIP40","PCIP41","PCIP42",
"PCIP43","PCIP44","PCIP45","PCIP46","PCIP47","PCIP48","PCIP49","PCIP50","PCIP51",
"PCIP52","PCIP54"))
Only deal with colleges having these traits:
Now the challenge is to match the available database columns to the criteria I wish to calculate.
fieldNames <-
c(
'UGDS','UGDS_WHITE','UGDS_BLACK','UGDS_HISP','UGDS_ASIAN','UGDS_AIAN','UGDS_NHPI',
'UGDS_2MOR','UGDS_NRA','UGDS_UNKN',"UG25abv","INC_PCT_LO","DEP_STAT_PCT_IND","DEP_INC_PCT_LO","IND_INC_PCT_LO",
"DEP_INC_PCT_M1","DEP_INC_PCT_M2","DEP_INC_PCT_H1","IND_INC_PCT_M1","IND_INC_PCT_M2","IND_INC_PCT_H1","IND_INC_PCT_H2",
"DEP_INC_PCT_H2","PAR_ED_PCT_1STGEN","pct_white","pct_black","pct_asian","pct_hispanic","pct_ba","pct_grad_prof",
"pct_born_us","median_hh_inc","poverty_rate","unemp_rate","ln_median_hh_inc","pell_ever","female","fsend_1","fsend_2",
"fsend_3","fsend_4","fsend_5","INC_PCT_LO","INC_PCT_M1","INC_PCT_M2","INC_PCT_H1","INC_PCT_H2","LATITUDE",
"LONGITUDE","ZIP","STABBR","region","LOCALE","CCSIZSET","DISTANCEONLY","RELAFFIL","CURROPER","NPT4_PUB",
"NPT4_PRIV","NPT41_PUB","NPT42_PUB","NPT43_PUB","NPT44_PUB","NPT45_PUB","NPT41_PRIV","NPT42_PRIV","NPT43_PRIV",
"NPT44_PRIV","NPT45_PRIV","TUITIONFEE_IN","TUITIONFEE_OUT","PCTPELL","ADM_RATE","RET_FT4","SATVR25","SATVR75",
"SATMT25","SATMT75","SATWR25","SATWR75","SATVRMID","SATMTMID","SATWRMID","ACTCM25","ACTCM75","ACTEN25","ACTEN75",
"ACTMT25","ACTMT75","ACTWR25","ACTWR75","ACTCMMID","ACTENMID","ACTMTMID","ACTWRMID","SAT_AVG","SAT_AVG_ALL",
"PCIP01","PCIP03","PCIP04","PCIP05","PCIP09","PCIP10","PCIP11","PCIP12",
"PCIP13","PCIP14","PCIP15","PCIP16","PCIP19","PCIP22","PCIP23","PCIP24","PCIP25","PCIP26","PCIP27","PCIP29","PCIP30",
"PCIP31","PCIP38","PCIP39","PCIP40","PCIP41","PCIP42","PCIP43","PCIP44","PCIP45","PCIP46","PCIP47","PCIP48","PCIP49",
"PCIP50","PCIP51","PCIP52","PCIP54","C150_4_WHITE","C150_4_BLACK","C150_4_HISP","C150_4_ASIAN","ENRL_ORIG_YR2_RT",
"LO_INC_ENRL_ORIG_YR2_RT","MD_INC_ENRL_ORIG_YR2_RT","HI_INC_ENRL_ORIG_YR2_RT","DEP_ENRL_ORIG_YR2_RT","IND_ENRL_ORIG_YR2_RT",
"FEMALE_ENRL_ORIG_YR2_RT","MALE_ENRL_ORIG_YR2_RT","PELL_ENRL_ORIG_YR2_RT","NOPELL_ENRL_ORIG_YR2_RT","LOAN_ENRL_ORIG_YR2_RT",
"NOLOAN_ENRL_ORIG_YR2_RT","FIRSTGEN_ENRL_ORIG_YR2_RT","NOT1STGEN_ENRL_ORIG_YR2_RT","ENRL_ORIG_YR3_RT","LO_INC_ENRL_ORIG_YR3_RT",
"MD_INC_ENRL_ORIG_YR3_RT","HI_INC_ENRL_ORIG_YR3_RT","DEP_ENRL_ORIG_YR3_RT","IND_ENRL_ORIG_YR3_RT","FEMALE_ENRL_ORIG_YR3_RT",
"MALE_ENRL_ORIG_YR3_RT","PELL_ENRL_ORIG_YR3_RT","NOPELL_ENRL_ORIG_YR3_RT","LOAN_ENRL_ORIG_YR3_RT","NOLOAN_ENRL_ORIG_YR3_RT",
"FIRSTGEN_ENRL_ORIG_YR3_RT","NOT1STGEN_ENRL_ORIG_YR3_RT","ENRL_ORIG_YR4_RT","LO_INC_ENRL_ORIG_YR4_RT","MD_INC_ENRL_ORIG_YR4_RT",
"HI_INC_ENRL_ORIG_YR4_RT","DEP_ENRL_ORIG_YR4_RT","IND_ENRL_ORIG_YR4_RT","FEMALE_ENRL_ORIG_YR4_RT","MALE_ENRL_ORIG_YR4_RT",
"PELL_ENRL_ORIG_YR4_RT","NOPELL_ENRL_ORIG_YR4_RT","LOAN_ENRL_ORIG_YR4_RT","NOLOAN_ENRL_ORIG_YR4_RT","FIRSTGEN_ENRL_ORIG_YR4_RT",
"NOT1STGEN_ENRL_ORIG_YR4_RT","ENRL_ORIG_YR6_RT","LO_INC_ENRL_ORIG_YR6_RT","MD_INC_ENRL_ORIG_YR6_RT","HI_INC_ENRL_ORIG_YR6_RT",
"DEP_ENRL_ORIG_YR6_RT","IND_ENRL_ORIG_YR6_RT","FEMALE_ENRL_ORIG_YR6_RT","MALE_ENRL_ORIG_YR6_RT","PELL_ENRL_ORIG_YR6_RT",
"NOPELL_ENRL_ORIG_YR6_RT","LOAN_ENRL_ORIG_YR6_RT","NOLOAN_ENRL_ORIG_YR6_RT","FIRSTGEN_ENRL_ORIG_YR6_RT","NOT1STGEN_ENRL_ORIG_YR6_RT",
"ENRL_ORIG_YR8_RT","LO_INC_ENRL_ORIG_YR8_RT","MD_INC_ENRL_ORIG_YR8_RT","HI_INC_ENRL_ORIG_YR8_RT","DEP_ENRL_ORIG_YR8_RT",
"IND_ENRL_ORIG_YR8_RT","FEMALE_ENRL_ORIG_YR8_RT","MALE_ENRL_ORIG_YR8_RT","PELL_ENRL_ORIG_YR8_RT","NOPELL_ENRL_ORIG_YR8_RT",
"LOAN_ENRL_ORIG_YR8_RT","NOLOAN_ENRL_ORIG_YR8_RT","FIRSTGEN_ENRL_ORIG_YR8_RT","NOT1STGEN_ENRL_ORIG_YR8_RT","C150_4_POOLED_SUPP",
"PCTFLOAN","DEBT_MDN","GRAD_DEBT_MDN","WDRAW_DEBT_MDN","LO_INC_DEBT_MDN","MD_INC_DEBT_MDN","HI_INC_DEBT_MDN","DEP_DEBT_MDN",
"IND_DEBT_MDN","PELL_DEBT_MDN","NOPELL_DEBT_MDN","CUML_DEBT_N","CUML_DEBT_P90","CUML_DEBT_P75","CUML_DEBT_P25",
"CUML_DEBT_P10","loan_ever","DEBT_MDN_SUPP","GRAD_DEBT_MDN_SUPP","GRAD_DEBT_MDN10YR_SUPP","CDR2","CDR3","COMPL_RPY_1YR_RT",
"NONCOM_RPY_1YR_RT","LO_INC_RPY_1YR_RT","MD_INC_RPY_1YR_RT","HI_INC_RPY_1YR_RT","DEP_RPY_1YR_RT","IND_RPY_1YR_RT",
"PELL_RPY_1YR_RT","NOPELL_RPY_1YR_RT","FEMALE_RPY_1YR_RT","MALE_RPY_1YR_RT","FIRSTGEN_RPY_1YR_RT","NOTFIRSTGEN_RPY_1YR_RT",
"RPY_3YR_RT","COMPL_RPY_3YR_RT","NONCOM_RPY_3YR_RT","LO_INC_RPY_3YR_RT","MD_INC_RPY_3YR_RT","HI_INC_RPY_3YR_RT","DEP_RPY_3YR_RT",
"IND_RPY_3YR_RT","PELL_RPY_3YR_RT","NOPELL_RPY_3YR_RT","FEMALE_RPY_3YR_RT","MALE_RPY_3YR_RT","FIRSTGEN_RPY_3YR_RT",
"NOTFIRSTGEN_RPY_3YR_RT","RPY_5YR_RT","COMPL_RPY_5YR_RT","NONCOM_RPY_5YR_RT","LO_INC_RPY_5YR_RT","MD_INC_RPY_5YR_RT",
"HI_INC_RPY_5YR_RT","DEP_RPY_5YR_RT","IND_RPY_5YR_RT","PELL_RPY_5YR_RT","NOPELL_RPY_5YR_RT","FEMALE_RPY_5YR_RT",
"MALE_RPY_5YR_RT","FIRSTGEN_RPY_5YR_RT","NOTFIRSTGEN_RPY_5YR_RT","RPY_7YR_RT","COMPL_RPY_7YR_RT","NONCOM_RPY_7YR_RT",
"LO_INC_RPY_7YR_RT","MD_INC_RPY_7YR_RT","HI_INC_RPY_7YR_RT","DEP_RPY_7YR_RT","IND_RPY_7YR_RT","PELL_RPY_7YR_RT",
"NOPELL_RPY_7YR_RT","FEMALE_RPY_7YR_RT","MALE_RPY_7YR_RT","FIRSTGEN_RPY_7YR_RT","NOTFIRSTGEN_RPY_7YR_RT","count_ed",
"count_nwne_p6","count_wne_p6","mn_earn_wne_p6","md_earn_wne_p6","pct10_earn_wne_p6","pct25_earn_wne_p6",
"pct75_earn_wne_p6","pct90_earn_wne_p6","sd_earn_wne_p6","gt_25k_p6","mn_earn_wne_inc1_p6","mn_earn_wne_inc2_p6",
"mn_earn_wne_inc3_p6"
)
# This dropped name set is part of the base query string...
fieldNames <- setdiff(fieldNames,c('CITY','LONGITUDE','LATITUDE','ZIP','region','LOCALE','CURROPER'))
fieldNames <- grep('PCIP',fieldNames,value = TRUE,invert = TRUE) # drop the discipline fields; they're grabbed separately.
# Upper case variables are from 2013 and lower case variables are from the Treasury dataset of 2005, except 'region'.
# fromS10 <- grep('ENRL_ORIG',fieldNames)
fromS11 <- which(fieldNames %in% grep('^[A-Z]|region',fieldNames))
fromS05 <- setdiff(seq_along(fieldNames),fromS11)
dfNames <- ifelse(grepl('^[A-Z]|region',fieldNames),fieldNames,paste(fieldNames,"2005",sep="_")) # put a '_2005' suffix on the Treasury variables.
makeQuery <- function(year,fieldNames,dfNames) {
paste("SELECT UNITID unitID,
INSTNM College,
CONTROL CollegeType,
PREDDEG degree,
CURROPER currop,
DISTANCEONLY distance,
RELAFFIL relaffil,
CITY city,
LONGITUDE lon,
LATITUDE lat,
ZIP zip,
st_fips state,
region region,
LOCALE locale,
CCBASIC ccbasic,
CCUGPROF ccugprof,
Year Year,",
paste(fieldNames,' ',dfNames,sep="",collapse = ","),
"FROM Scorecard
WHERE Year=",year)
}
queryString2005 <- makeQuery(2005,fieldNames,dfNames)
queryString2013 <- makeQuery(2013,fieldNames,dfNames)
discNames <- gsub('^([A-Z][-a-z]+)[, and/]*([A-Z][a-z]+).*','\\1\\2',discgrp$LABEL)
discgrp %<>% tbl_df %>% mutate(discName = discNames)
queryStringDscplns <- makeQuery(2013,discgrp$VARIABLE.NAME,discNames)
# Retrieve the data for 2005 because that is the latest with Treasury data on student families.
student2005 <- dbGetQuery(db,queryString2005)
## Warning in rsqlite_fetch(res@ptr, n = n): Column `zip`: mixed type, first
## seen values of type integer, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_LO`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_STAT_PCT_IND`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_LO`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_LO`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_M1`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_M2`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_H1`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_M1`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_M2`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_H1`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_H2`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_H2`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PAR_ED_PCT_1STGEN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct_white_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct_black_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct_asian_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct_hispanic_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct_ba_2005`: mixed type,
## first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct_grad_prof_2005`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct_born_us_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `median_hh_inc_2005`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `poverty_rate_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `unemp_rate_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `ln_median_hh_inc_2005`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pell_ever_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `female_2005`: mixed type,
## first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_M1`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_M2`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_H1`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_H2`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `ENRL_ORIG_YR2_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FEMALE_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MALE_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LOAN_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOLOAN_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `FIRSTGEN_ENRL_ORIG_YR2_RT`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `NOT1STGEN_ENRL_ORIG_YR2_RT`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `ENRL_ORIG_YR3_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FEMALE_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MALE_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LOAN_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOLOAN_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `FIRSTGEN_ENRL_ORIG_YR3_RT`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `NOT1STGEN_ENRL_ORIG_YR3_RT`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `ENRL_ORIG_YR4_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FEMALE_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MALE_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LOAN_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOLOAN_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `FIRSTGEN_ENRL_ORIG_YR4_RT`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `NOT1STGEN_ENRL_ORIG_YR4_RT`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `ENRL_ORIG_YR6_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FEMALE_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MALE_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LOAN_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOLOAN_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `FIRSTGEN_ENRL_ORIG_YR6_RT`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `NOT1STGEN_ENRL_ORIG_YR6_RT`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `ENRL_ORIG_YR8_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FEMALE_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MALE_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LOAN_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOLOAN_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `FIRSTGEN_ENRL_ORIG_YR8_RT`: mixed type, first seen values of type string,
## coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `NOT1STGEN_ENRL_ORIG_YR8_RT`: mixed type, first seen values of type string,
## coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEBT_MDN`: mixed type,
## first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `GRAD_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `WDRAW_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_N`: mixed type,
## first seen values of type integer, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_P90`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_P75`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_P25`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_P10`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `loan_ever_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEBT_MDN_SUPP`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `GRAD_DEBT_MDN_SUPP`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `GRAD_DEBT_MDN10YR_SUPP`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `count_nwne_p6_2005`:
## mixed type, first seen values of type integer, coercing other values of
## type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `count_wne_p6_2005`: mixed
## type, first seen values of type integer, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `mn_earn_wne_p6_2005`:
## mixed type, first seen values of type integer, coercing other values of
## type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `md_earn_wne_p6_2005`:
## mixed type, first seen values of type integer, coercing other values of
## type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct10_earn_wne_p6_2005`:
## mixed type, first seen values of type integer, coercing other values of
## type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct25_earn_wne_p6_2005`:
## mixed type, first seen values of type integer, coercing other values of
## type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct75_earn_wne_p6_2005`:
## mixed type, first seen values of type integer, coercing other values of
## type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct90_earn_wne_p6_2005`:
## mixed type, first seen values of type integer, coercing other values of
## type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `sd_earn_wne_p6_2005`:
## mixed type, first seen values of type integer, coercing other values of
## type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `gt_25k_p6_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `mn_earn_wne_inc1_p6_2005`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `mn_earn_wne_inc2_p6_2005`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `mn_earn_wne_inc3_p6_2005`: mixed type, first seen values of type real,
## coercing other values of type string
# Retrieve the data for 2013 because that's the latest data.
student2013 <- dbGetQuery(db,queryString2013)
## Warning in rsqlite_fetch(res@ptr, n = n): Column `zip`: mixed type, first
## seen values of type integer, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_LO`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_STAT_PCT_IND`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_LO`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_LO`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_M1`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_M2`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_H1`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_M1`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_M2`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_H1`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_H2`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_H2`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PAR_ED_PCT_1STGEN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_M1`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_M2`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_H1`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_H2`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `C150_4_POOLED_SUPP`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEBT_MDN`: mixed type,
## first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `GRAD_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `WDRAW_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_N`: mixed type,
## first seen values of type integer, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_P90`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_P75`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_P25`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_P10`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEBT_MDN_SUPP`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `GRAD_DEBT_MDN_SUPP`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `GRAD_DEBT_MDN10YR_SUPP`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `RPY_3YR_RT`: mixed type,
## first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `COMPL_RPY_3YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NONCOM_RPY_3YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_RPY_3YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_RPY_3YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_RPY_3YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_RPY_3YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_RPY_3YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_RPY_3YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_RPY_3YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FEMALE_RPY_3YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MALE_RPY_3YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FIRSTGEN_RPY_3YR_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOTFIRSTGEN_RPY_3YR_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `RPY_5YR_RT`: mixed type,
## first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `COMPL_RPY_5YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NONCOM_RPY_5YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_RPY_5YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_RPY_5YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_RPY_5YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_RPY_5YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_RPY_5YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_RPY_5YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_RPY_5YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FEMALE_RPY_5YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MALE_RPY_5YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FIRSTGEN_RPY_5YR_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOTFIRSTGEN_RPY_5YR_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `RPY_7YR_RT`: mixed type,
## first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `COMPL_RPY_7YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NONCOM_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FEMALE_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MALE_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FIRSTGEN_RPY_7YR_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOTFIRSTGEN_RPY_7YR_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
disciplines2013 <- dbGetQuery(db,queryStringDscplns)
## Warning in rsqlite_fetch(res@ptr, n = n): Column `zip`: mixed type, first
## seen values of type integer, coercing other values of type string
# Disconnect the database.
dbDisconnect(db)
# Join the years together.
student2013 <- student2013[!sapply(student2013,function(x) all(is.na(x)))] %>% tbl_df
student2005 <- student2005[!sapply(student2005,function(x) all(is.na(x)))] %>% tbl_df
# student <- student2013 %>% tbl_df %>%
# dplyr::select(-contains('_2005')) %>%
# inner_join(student2005 %>% dplyr::select(unitID,contains('_2005')), by = "unitID")
student <- student2013 %>%
inner_join(student2005 %>% dplyr::select(unitID,contains('_2005')), by = "unitID")
# Add the dominant disciplines
discplns <- disciplines2013 %>% tbl_df %>%
filter(unitID %in% student$unitID) %>%
mutate(irow = seq_len(nrow(.))) %>%
gather(key=Discipline,value=Proportion,one_of(discNames)) %>%
group_by(irow) %>%
arrange(desc(Proportion)) %>%
summarize(
unitID = first(unitID),
College = first(College),
domDisc1 = first(Discipline),
pctDisc1 = 100*first(Proportion),
domDisc2 = nth(Discipline,n=2),
pctDisc2 = 100*nth(Proportion,n=2)
) %>%
mutate(
isSpecialty = pctDisc1 > 70,
isTheological = grepl('Theolo|Relig',domDisc1) | grepl('Theolo|Relig',domDisc2)
)
# CALLED THE MAIN DATA TABLE `student` BECAUSE ORIGINALLY ONLY CONTAINED
# `student` CATEGORY OF DATA FIELDS. NOW THIS LEGACY NAME AWKWARDLY CONFUSES
# THIS DATA TABLE OF UNIVERSITIES/COLLEGES WITH THE OTHER DATA OBJECTS
# CONTAINING SPECIFIC STUDENT PROPERTIES -- SEE LATER...
# Now add the disciplines as grouped
student %<>%
inner_join(discplns %>% dplyr::select(-College),by='unitID')
usa.states.dc <- c("District of Columbia",state.name)
student0 <- student
NOTE: At this point the foundational data table is student0, which is further manipulated as student. This includes filtering out colleges we’re not interested in. See the printouts below to get a sense of what data were kept.
# Filter down to just the colleges meeting the following criteria:
student <- student0 %>%
filter(CollegeType != 'Private for-profit',
currop == 'Currently certified as operating',
distance == 'Not distance-education only',
degree == "Predominantly bachelor's-degree granting",
as.character(region) != 'U.S. Service Schools',
!grepl('^Associate',ccbasic),
state %in% usa.states.dc,
!is.na(ccbasic),
!is.na(UGDS),
!is.na(pell_ever_2005),
!isTheological,
!is.na(mn_earn_wne_inc2_p6_2005))
# Turn character columns into factors.
student <- student %>% lapply(
function(x)
if(is.character(x) | is.factor(x) | is.logical(x)) {
if('PrivacySuppressed' %in% x) {
tmp<-x;tmp[x=='PrivacySuppressed']<-NA;as.numeric(tmp)
} else {
factor(gsub("ââ¬â¢","'",x))
}
} else as.numeric(x)) %>%
as.data.frame %>% tbl_df
# add a religious affiliation indicator variable:
student %<>%
mutate(relaffilFLAG=factor(ifelse(is.na(relaffil),
FALSE,
!grepl('^[Nn]ot ',as.character(relaffil)))))
a<- student %>% dplyr::select(grep('^(SAT|ACT)',names(.))) %>% apply(1,function(x) !all(is.na(x)))
student %<>% mutate(stdzdtest=a)
show(dim(student))
## [1] 1514 198
#show(summary(student))
Drop all the Penn St. and U. of Pitt. schools except the main campus because same 2005 Treasury data entered in all rows for each, resp.
stdttmp <- student %>% filter((College %in% c('Pennsylvania State University-Main Campus',
'University of Pittsburgh-Pittsburgh Campus')) |
!(College %in% grep('Pennsylvania State University|University of Pittsburgh',
College,value=TRUE)))
show(setdiff(student$College,stdttmp$College))
## [1] "Pennsylvania State University-Penn State Erie-Behrend College"
## [2] "Pennsylvania State University-Penn State New Kensington"
## [3] "Pennsylvania State University-Penn State Shenango"
## [4] "Pennsylvania State University-Penn State Wilkes-Barre"
## [5] "Pennsylvania State University-Penn State Worthington Scranton"
## [6] "Pennsylvania State University-Penn State Lehigh Valley"
## [7] "Pennsylvania State University-Penn State Altoona"
## [8] "Pennsylvania State University-Penn State Beaver"
## [9] "Pennsylvania State University-Penn State Berks"
## [10] "Pennsylvania State University-Penn State Harrisburg"
## [11] "Pennsylvania State University-Penn State Brandywine"
## [12] "Pennsylvania State University-Penn State Greater Allegheny"
## [13] "Pennsylvania State University-Penn State Abington"
## [14] "Pennsylvania State University-Penn State Schuylkill"
## [15] "Pennsylvania State University-Penn State York"
## [16] "University of Pittsburgh-Bradford"
## [17] "University of Pittsburgh-Greensburg"
## [18] "University of Pittsburgh-Johnstown"
show(grep('Pennsylvania State University|University of Pittsburgh',stdttmp$College,value=TRUE))
## [1] "Pennsylvania State University-Main Campus"
## [2] "University of Pittsburgh-Pittsburgh Campus"
student <- stdttmp
rm(stdttmp)
student <- student[!sapply(student,function(x) length(which(is.na(x)))>1000)]
# Drop schools without earnings data:
student %>% filter(is.na(pct10_earn_wne_p6_2005) | pct10_earn_wne_p6_2005==0) %>%
dplyr::select(unitID,College,UGDS,matches('pct.+earn')) %>%
print(n=60)
## # A tibble: 58 x 7
## unitID College
## <dbl> <fctr>
## 1 102395 United States Sports Academy
## 2 106306 Arkansas Baptist College
## 3 110316 California Institute of Integral Studies
## 4 112570 Columbia College-Hollywood
## 5 116712 John F Kennedy University
## 6 116846 American Jewish University
## 7 117168 Laguna College of Art and Design
## 8 119058 Monterey Institute of International Studies
## 9 120768 Pacific Oaks College
## 10 120838 Pacific States University
## 11 122506 San Francisco Conservatory of Music
## 12 123633 South Baylo University
## 13 123952 Southern California Institute of Architecture
## 14 124292 Thomas Aquinas College
## 15 143297 Blessing Rieman College of Nursing
## 16 146533 Lakeview College of Nursing
## 17 147590 National University of Health Sciences
## 18 148575 Saint Francis Medical Center College of Nursing
## 19 148593 St John's College of Nursing
## 20 148849 Shimer College
## 21 149028 Saint Anthony College of Nursing
## 22 149639 VanderCook College of Music
## 23 153861 Maharishi University of Management
## 24 160409 Saint Joseph Seminary College
## 25 160959 College of the Atlantic
## 26 164872 Boston Architectural College
## 27 164933 The Boston Conservatory
## 28 166045 Hebrew College
## 29 166869 MGH Institute of Health Professions
## 30 177038 Cleveland University-Kansas City
## 31 179450 Saint Luke's College of Health Sciences
## 32 180878 Bryan College of Health Sciences
## 33 183275 Thomas More College of Liberal Arts
## 34 187745 Institute of American Indian and Alaska Native Culture
## 35 190576 CUNY Graduate School and University Center
## 36 192712 Manhattan School of Music
## 37 197221 Webb Institute
## 38 198677 Heritage Bible College
## 39 201061 Art Academy of Cincinnati
## 40 202073 Cleveland Institute of Music
## 41 205027 Pontifical College Josephinum
## 42 207856 Southwestern Christian University
## 43 209533 Oregon College of Art and Craft
## 44 211893 Curtis Institute of Music
## 45 214971 Pennsylvania Academy of the Fine Arts
## 46 215053 Pennsylvania College of Art and Design
## 47 220808 Memphis College of Art
## 48 221254 O'More College of Design
## 49 223214 Texas A & M University Health Science Center
## 50 231095 Sterling College
## 51 231651 Regent University
## 52 233611 Southern Virginia University
## 53 238324 Bellin College
## 54 243823 Parker University
## 55 373827 New England School of Communications
## 56 385619 Everglades University
## 57 416801 The University of Texas MD Anderson Cancer Center
## 58 435000 Louisiana State University Health Sciences Center-Shreveport
## # ... with 5 more variables: UGDS <dbl>, pct10_earn_wne_p6_2005 <dbl>,
## # pct25_earn_wne_p6_2005 <dbl>, pct75_earn_wne_p6_2005 <dbl>,
## # pct90_earn_wne_p6_2005 <dbl>
student %<>%
filter(!is.na(pct10_earn_wne_p6_2005) & pct10_earn_wne_p6_2005>0)
#show(summary(student))
Most selective schools according to admissions rate:
# Most selective schools according to admissions rate:
# student %>%
# filter(!is.na(ADM_RATE) & ADM_RATE<0.2 & UGDS>600) %>%
# dplyr::select(unitID,College,UGDS,matches('SAT|ACT'),ADM_RATE) %>%
# arrange(ADM_RATE) %>%
# print(n=40)
student %>%
filter(!is.na(ADM_RATE) & ADM_RATE<0.2 & UGDS>600) %>%
dplyr::select(unitID,College,UGDS,contains('Disc'),ADM_RATE,pell_ever_2005) %>%
arrange(ADM_RATE) %>%
print(n=40)
## # A tibble: 37 x 9
## unitID College UGDS
## <dbl> <fctr> <dbl>
## 1 243744 Stanford University 6980
## 2 166027 Harvard University 7278
## 3 130794 Yale University 5422
## 4 186131 Princeton University 5234
## 5 190150 Columbia University in the City of New York 7970
## 6 190372 Cooper Union for the Advancement of Science and Art 851
## 7 166683 Massachusetts Institute of Technology 4510
## 8 144050 University of Chicago 5697
## 9 217156 Brown University 6182
## 10 182670 Dartmouth College 4154
## 11 110404 California Institute of Technology 977
## 12 112260 Claremont McKenna College 1311
## 13 215062 University of Pennsylvania 10606
## 14 221999 Vanderbilt University 6794
## 15 178697 College of the Ozarks 1513
## 16 133872 Adventist University of Health Sciences 2064
## 17 198419 Duke University 6501
## 18 176318 Rust College 922
## 19 121345 Pomona College 1587
## 20 164465 Amherst College 1785
## 21 216287 Swarthmore College 1524
## 22 121257 Pitzer College 1081
## 23 161004 Bowdoin College 1789
## 24 147767 Northwestern University 8855
## 25 179867 Washington University in St Louis 6851
## 26 190415 Cornell University 14309
## 27 227757 Rice University 3920
## 28 131496 Georgetown University 7261
## 29 168342 Williams College 2051
## 30 230959 Middlebury College 2466
## 31 110635 University of California-Berkeley 25951
## 32 162928 Johns Hopkins University 5932
## 33 115409 Harvey Mudd College 803
## 34 234207 Washington and Lee University 1846
## 35 168148 Tufts University 5146
## 36 123961 University of Southern California 18087
## 37 138716 Albany State University 3566
## # ... with 6 more variables: domDisc1 <fctr>, pctDisc1 <dbl>,
## # domDisc2 <fctr>, pctDisc2 <dbl>, ADM_RATE <dbl>, pell_ever_2005 <dbl>
#===================================================
What follows are code chunks to generate specific contextual data tables of conditional probabilities and Bayes factors for different aspects of the college experience for a student at each college, as was described above. Note that I fudged the Bayes factors – even after defining the functions to compute them properly. Also, I did a bit of exploration with poLCA, but commented it out. (Using ggplot2, it’s easy to visually explore clustering solutions using ‘small multiples’ (i.e., facet_wrap). But not shown here….)
ethnicity <- student %>%
dplyr::select(unitID,College,contains("UGDS")) %>%
mutate(probSchool = UGDS/sum(UGDS),
totprob = rowSums(as.matrix(.[4:ncol(.)])))
ethnicity %>% print(n=20)
## # A tibble: 1,438 x 14
## unitID College UGDS UGDS_WHITE UGDS_BLACK
## <dbl> <fctr> <dbl> <dbl> <dbl>
## 1 100654 Alabama A & M University 4051 0.0279 0.9501
## 2 100663 University of Alabama at Birmingham 11200 0.5987 0.2590
## 3 100706 University of Alabama in Huntsville 5525 0.7012 0.1310
## 4 100724 Alabama State University 5354 0.0161 0.9285
## 5 100751 The University of Alabama 28692 0.7865 0.1140
## 6 100812 Athens State University 2999 0.7513 0.1064
## 7 100830 Auburn University at Montgomery 4322 0.5532 0.3031
## 8 100858 Auburn University 19761 0.8543 0.0714
## 9 100937 Birmingham Southern College 1181 0.8103 0.0940
## 10 101073 Concordia College Alabama 523 0.0115 0.9388
## 11 101189 Faulkner University 2358 0.3919 0.5254
## 12 101435 Huntingdon College 1100 0.5845 0.1918
## 13 101480 Jacksonville State University 7195 0.6655 0.2628
## 14 101541 Judson College 331 0.7915 0.1450
## 15 101587 University of West Alabama 1988 0.4291 0.4195
## 16 101675 Miles College 1663 0.0180 0.9693
## 17 101693 University of Mobile 1472 0.6461 0.2160
## 18 101709 University of Montevallo 2610 0.7368 0.1398
## 19 101879 University of North Alabama 5536 0.7150 0.1384
## 20 101912 Oakwood University 1854 0.0102 0.8495
## # ... with 1,418 more rows, and 9 more variables: UGDS_HISP <dbl>,
## # UGDS_ASIAN <dbl>, UGDS_AIAN <dbl>, UGDS_NHPI <dbl>, UGDS_2MOR <dbl>,
## # UGDS_NRA <dbl>, UGDS_UNKN <dbl>, probSchool <dbl>, totprob <dbl>
This computes P(E | not(H)) the conditional probability of evidence given the hypothesis is FALSE using inputs of cpH = P(E | H) probability of evidence given the hypothesis is TRUE and pH = P(H) the marginal (prior) probability of hypothesis being TRUE. But, in this case of almost 1500 schools, where “H” is school si, P(E|not(H)) is approximately P(E) = sum(P(E|H)*P(H);over H). So, could skip this function and just use P(E).
IMPORTANT: I use log10 of the Bayes factors, which is a bit unusual, but I know powers of 10 better than powers of 2 or e.
cpNotH <- function(cpH,pH){
pEH <- cpH * pH
pE <- sum(pEH)
return((pE - pEH)/(1 - pH))
}
BFactor <- function(cpH,cpNotH,logFunc=log10){
x <- cpH / cpNotH
return(if(is.function(logFunc)) logFunc(x) else x)
}
makeCpNotH <- function(x) mapply(cpNotH,cpH=x,pH=ethnicity$probSchool)
makeBF <- function(x) {x <- 1.0E-9 + x;log10(x/sum(x*ethnicity$probSchool,na.rm=TRUE))} # Approximate
ethnicBF <- ethnicity %>%
dplyr::select(contains("UGDS_")) %>%
mutate_each(funs(makeBF)) %>%
setNames(gsub("UGDS","BF",names(.))) %$%
bind_cols(ethnicity[1:2],.) %>%
mutate(BF_female_2005 = makeBF(student$female_2005))
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over all variables, use `mutate_all()`
#=====================================================
income <- student %>%
dplyr::select(unitID,College,UGDS,DEP_STAT_PCT_IND,contains("INC_PCT")) %>%
mutate(probSchool = UGDS/sum(UGDS),
totprob = rowSums(as.matrix(.[c("INC_PCT_LO","INC_PCT_M1","INC_PCT_M2","INC_PCT_H1",
"INC_PCT_H2")])),
totprobdep = rowSums(as.matrix(.[c("DEP_INC_PCT_LO","DEP_INC_PCT_M1","DEP_INC_PCT_M2",
"DEP_INC_PCT_H1","DEP_INC_PCT_H2")])),
totprobind = rowSums(as.matrix(.[c("IND_INC_PCT_LO","IND_INC_PCT_M1","IND_INC_PCT_M2",
"IND_INC_PCT_H1","IND_INC_PCT_H2")])),
totprob2dep = rowSums(as.matrix(.[c("DEP_INC_PCT_LO","DEP_INC_PCT_H2")])),
totprob2ind = rowSums(as.matrix(.[c("IND_INC_PCT_LO","IND_INC_PCT_H2")])))
income %>% print(n=20)
## # A tibble: 1,438 x 25
## unitID College UGDS DEP_STAT_PCT_IND
## <dbl> <fctr> <dbl> <dbl>
## 1 100654 Alabama A & M University 4051 0.12474184
## 2 100663 University of Alabama at Birmingham 11200 0.29927474
## 3 100706 University of Alabama in Huntsville 5525 0.37228542
## 4 100724 Alabama State University 5354 0.15468227
## 5 100751 The University of Alabama 28692 0.16031165
## 6 100812 Athens State University 2999 0.73529412
## 7 100830 Auburn University at Montgomery 4322 0.29577465
## 8 100858 Auburn University 19761 0.08752683
## 9 100937 Birmingham Southern College 1181 0.02843602
## 10 101073 Concordia College Alabama 523 0.16939891
## 11 101189 Faulkner University 2358 0.58565955
## 12 101435 Huntingdon College 1100 0.25824176
## 13 101480 Jacksonville State University 7195 0.27142857
## 14 101541 Judson College 331 0.27363184
## 15 101587 University of West Alabama 1988 0.25444265
## 16 101675 Miles College 1663 0.15498652
## 17 101693 University of Mobile 1472 0.36000000
## 18 101709 University of Montevallo 2610 0.18958155
## 19 101879 University of North Alabama 5536 0.20905350
## 20 101912 Oakwood University 1854 0.21806167
## # ... with 1,418 more rows, and 21 more variables: INC_PCT_LO <dbl>,
## # DEP_INC_PCT_LO <dbl>, IND_INC_PCT_LO <dbl>, DEP_INC_PCT_M1 <dbl>,
## # DEP_INC_PCT_M2 <dbl>, DEP_INC_PCT_H1 <dbl>, IND_INC_PCT_M1 <dbl>,
## # IND_INC_PCT_M2 <dbl>, IND_INC_PCT_H1 <dbl>, IND_INC_PCT_H2 <dbl>,
## # DEP_INC_PCT_H2 <dbl>, INC_PCT_M1 <dbl>, INC_PCT_M2 <dbl>,
## # INC_PCT_H1 <dbl>, INC_PCT_H2 <dbl>, probSchool <dbl>, totprob <dbl>,
## # totprobdep <dbl>, totprobind <dbl>, totprob2dep <dbl>,
## # totprob2ind <dbl>
# To avoid dropping many schools due to missing values will hold 7 variables
# for income: DEP_STAT_PCT_IND, DEP_INC_PCT_LO, DEP_INC_PCT_H2 &
# DEP_INC_GTLO_LTH2, IND_INC_PCT_LO, IND_INC_PCT_H2 & IND_INC_GTLO_LTH2; where
# the x_INC_GTLO_LTH2 variables are 1 - x_INC_PCT_H2 - x_INC_PCT_LO; and the
# sum of the latter 6 columns = 1. So must mutate the current IND_... columns
# by multiplying by DEP_STAT_PCT_IND. An "x" is appended to the end of the
# newly computed variable names.
income %<>% mutate(probSchool = UGDS/sum(UGDS),
DEP_STAT_PCT_INDx = DEP_STAT_PCT_IND,
DEP_STAT_PCT_DEPx = 1 - DEP_STAT_PCT_INDx,
DEP_INC_PCT_LOx = DEP_INC_PCT_LO*DEP_STAT_PCT_DEPx,
DEP_INC_PCT_HI2x = DEP_INC_PCT_H2*DEP_STAT_PCT_DEPx,
DEP_INC_GTLO_LTH2x = DEP_STAT_PCT_DEPx - DEP_INC_PCT_LOx - DEP_INC_PCT_HI2x,
IND_INC_PCT_LOx = IND_INC_PCT_LO*DEP_STAT_PCT_INDx,
IND_INC_PCT_HI2x = IND_INC_PCT_H2*DEP_STAT_PCT_INDx,
IND_INC_GTLO_LTH2x = DEP_STAT_PCT_IND - IND_INC_PCT_LOx - IND_INC_PCT_HI2x,
totprobx = DEP_INC_PCT_LOx + DEP_INC_PCT_HI2x + DEP_INC_GTLO_LTH2x + IND_INC_PCT_LOx + IND_INC_PCT_HI2x + IND_INC_GTLO_LTH2x) %>%
dplyr::select(unitID,College,UGDS,probSchool,matches("x$"))
income %>% print(n=20)
## # A tibble: 1,438 x 13
## unitID College UGDS probSchool
## <dbl> <fctr> <dbl> <dbl>
## 1 100654 Alabama A & M University 4051 5.165870e-04
## 2 100663 University of Alabama at Birmingham 11200 1.428234e-03
## 3 100706 University of Alabama in Huntsville 5525 7.045528e-04
## 4 100724 Alabama State University 5354 6.827467e-04
## 5 100751 The University of Alabama 28692 3.658829e-03
## 6 100812 Athens State University 2999 3.824351e-04
## 7 100830 Auburn University at Montgomery 4322 5.511452e-04
## 8 100858 Auburn University 19761 2.519940e-03
## 9 100937 Birmingham Southern College 1181 1.506021e-04
## 10 101073 Concordia College Alabama 523 6.669341e-05
## 11 101189 Faulkner University 2358 3.006942e-04
## 12 101435 Huntingdon College 1100 1.402730e-04
## 13 101480 Jacksonville State University 7195 9.175126e-04
## 14 101541 Judson College 331 4.220941e-05
## 15 101587 University of West Alabama 1988 2.535115e-04
## 16 101675 Miles College 1663 2.120672e-04
## 17 101693 University of Mobile 1472 1.877107e-04
## 18 101709 University of Montevallo 2610 3.328295e-04
## 19 101879 University of North Alabama 5536 7.059555e-04
## 20 101912 Oakwood University 1854 2.364237e-04
## # ... with 1,418 more rows, and 9 more variables: DEP_STAT_PCT_INDx <dbl>,
## # DEP_STAT_PCT_DEPx <dbl>, DEP_INC_PCT_LOx <dbl>,
## # DEP_INC_PCT_HI2x <dbl>, DEP_INC_GTLO_LTH2x <dbl>,
## # IND_INC_PCT_LOx <dbl>, IND_INC_PCT_HI2x <dbl>,
## # IND_INC_GTLO_LTH2x <dbl>, totprobx <dbl>
#income[income$College=='California Institute of Technology',
# c('IND_INC_PCT_LOx','IND_INC_PCT_HI2x','IND_INC_GTLO_LTH2x','totprobx')] <- c(0,0,0,1)
makeBF <- function(x) {x <- 1.0E-9 + x;log10(x/sum(x*income$probSchool,na.rm=TRUE))} # Approximate
incomeBF <- income %>%
dplyr::select(matches("x$"),-totprobx) %>%
mutate_each(funs(makeBF)) %>%
setNames(gsub("(.+)x$","BF_\\1x",names(.))) %$%
bind_cols(income[1:4],.)
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over all variables, use `mutate_all()`
incomeBFdiscrete <- incomeBF %>% dplyr::select(5:ncol(incomeBF)) %>% filter(complete.cases(.)) %>%
lapply(function(x) discretize(data.frame(x[ifelse(is.na(x),FALSE,x>-5)]),method='quantile',breaks=5))
incomeBFlowestState <- incomeBF %>% dplyr::select(5:ncol(incomeBF)) %>% filter(complete.cases(.)) %>%
lapply(function(x) ifelse(x<= -5,"(-Inf,-5]",'FALSE'))
incomeBFd <- mapply(function(x,y) {
tmp <- x
tmp[tmp=='FALSE'] <- as.character(y[[1]])
return(factor(tmp,levels=c("(-Inf,-5]",levels(y[[1]]))))},
incomeBFlowestState,incomeBFdiscrete,SIMPLIFY = FALSE)
names(incomeBFd) <- names(incomeBFdiscrete)
incomeBFd %<>% data.frame %>% tbl_df
rm(list=c('incomeBFdiscrete','incomeBFlowestState'))
incomeBFd <- incomeBF %>%
filter(complete.cases(incomeBF %>% dplyr::select(5:ncol(incomeBF)))) %>%
dplyr::select(1:4) %>%
bind_cols(incomeBFd)
Here, I did some cluster analysis using package poLCA but ultimately set it aside to be dug into later…
# #-------------------------------------
# bmod <- poLCA(as.formula(paste0('cbind(',paste(names(incomeBFd)[-(1:4)],collapse=','),') ~ 1')),
# data=incomeBFd,nclass = 9,nrep=13,na.rm=FALSE)
# new.probs.start <- poLCA.reorder(bmod$probs.start,order(bmod$P,decreasing=TRUE))
# bmod <- poLCA(as.formula(paste0('cbind(',paste(names(incomeBFd)[-(1:4)],collapse=','),') ~ 1')),
# data=incomeBFd,nclass = 9,probs.start=new.probs.start,na.rm=FALSE)
#
# # Compute the mutual information shared by each variable with the class labels:
# miBFclass <- incomeBFd %>%
# mutate(class=bmod$predclass) %$%
# sapply(grep('^BF',names(.),value=TRUE),mInfo,ynm='class',data=.) %>%
# sort(decreasing=TRUE) %T>%
# print
# incomeBFd %>%
# mutate(class=bmod$predclass) %>%
# inner_join(incomeBF,by='unitID') %>%
# group_by(class) %>%
# summarize_each(funs(median),matches('x.y$')) %>%
# dplyr::select(one_of(names(miBFclass))) %>%
# print(width=Inf)
#
# incomeBFd %>%
# dplyr::select(-c(1,3:4)) %>%
# mutate(class=bmod$predclass) %>%
# filter(class==9) %>%
# print(n=20,width=Inf)
#
# incomeBFd %>%
# mutate(class=bmod$predclass) %>%
# filter(grepl("Cali.+Insti.+Tech|Harvard|Mass.+Inst.+Tech|Northwestern Univ|Princeton|Cornell U",
# College)) %>%
# dplyr::select(one_of(setdiff(names(.),names(miBFclass))),one_of(names(miBFclass))) %>%
# arrange(class) %>%
# print(width=Inf)
# #---------------------------------------
#===================================================
aid <- student %>%
dplyr::select(unitID,College,UGDS,matches("pell_ever|fsend")) %>%
mutate(probSchool = UGDS/sum(UGDS),
totprob = rowSums(as.matrix(.[grepl('fsend',names(.))])))
aid %>% print(n=20)
## # A tibble: 1,438 x 11
## unitID College UGDS pell_ever_2005
## <dbl> <fctr> <dbl> <dbl>
## 1 100654 Alabama A & M University 4051 0.81
## 2 100663 University of Alabama at Birmingham 11200 0.59
## 3 100706 University of Alabama in Huntsville 5525 0.60
## 4 100724 Alabama State University 5354 0.85
## 5 100751 The University of Alabama 28692 0.53
## 6 100812 Athens State University 2999 0.68
## 7 100830 Auburn University at Montgomery 4322 0.67
## 8 100858 Auburn University 19761 0.46
## 9 100937 Birmingham Southern College 1181 0.35
## 10 101073 Concordia College Alabama 523 1.00
## 11 101189 Faulkner University 2358 0.70
## 12 101435 Huntingdon College 1100 0.53
## 13 101480 Jacksonville State University 7195 0.70
## 14 101541 Judson College 331 0.62
## 15 101587 University of West Alabama 1988 0.76
## 16 101675 Miles College 1663 0.89
## 17 101693 University of Mobile 1472 0.62
## 18 101709 University of Montevallo 2610 0.61
## 19 101879 University of North Alabama 5536 0.65
## 20 101912 Oakwood University 1854 0.69
## # ... with 1,418 more rows, and 7 more variables: fsend_1_2005 <dbl>,
## # fsend_2_2005 <dbl>, fsend_3_2005 <dbl>, fsend_4_2005 <dbl>,
## # fsend_5_2005 <dbl>, probSchool <dbl>, totprob <dbl>
makeBF <- function(x) {x <- 1.0E-9 + x;log10(x/sum(x*aid$probSchool,na.rm=TRUE))} # Approximate
aidBF <- aid %>%
dplyr::select(matches("pell_ever|fsend")) %>%
mutate_each(funs(makeBF)) %>%
setNames(paste0("BF_",names(.))) %$%
bind_cols(aid[1:3],.)
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over all variables, use `mutate_all()`
g <- aidBF %>% gather(key=Variable,value=BF,-(1:3)) %>%
ggplot(aes(x=BF,fill=Variable))
g + geom_density(alpha=0.3) +
facet_wrap(~Variable) +
scale_x_continuous(lim=c(-1,1)) +
theme(text=element_text(size=14,face = 'bold'))
## Warning: Removed 2 rows containing non-finite values (stat_density).
#===================================================
earnRepay <- student %>%
dplyr::select(unitID,College,UGDS,matches("^(RPY|CDR|.+earn_wne_p6)")) %T>%
print(n=20) %>%
mutate(probSchool = UGDS/sum(UGDS),
sdlog = sqrt(log((sd_earn_wne_p6_2005/mn_earn_wne_p6_2005)^2+1)),
meanlog = log(mn_earn_wne_p6_2005^2/sqrt(mn_earn_wne_p6_2005^2+sd_earn_wne_p6_2005^2)),
p_le30K = plnorm(30.0E3,meanlog,sdlog),
p_gt30Kle48K = plnorm(48.0E3,meanlog,sdlog) - p_le30K,
p_gt48Kle75K = plnorm(75.0E3,meanlog,sdlog) - plnorm(48.0E3,meanlog,sdlog),
p_gt75Kle110K = plnorm(110.0E3,meanlog,sdlog) - plnorm(75.0E3,meanlog,sdlog),
p_gt110K = plnorm(110.0E3,meanlog,sdlog,lower.tail=FALSE),
totprob = p_le30K + p_gt30Kle48K + p_gt48Kle75K + p_gt75Kle110K + p_gt110K) %>%
dplyr::select(1:7,probSchool,starts_with('p_')) %T>%
print(n=20)
## # A tibble: 1,438 x 14
## unitID College UGDS CDR3 RPY_3YR_RT
## <dbl> <fctr> <dbl> <dbl> <dbl>
## 1 100654 Alabama A & M University 4051 0.163 0.4447139
## 2 100663 University of Alabama at Birmingham 11200 0.080 0.7562667
## 3 100706 University of Alabama in Huntsville 5525 0.077 0.7819979
## 4 100724 Alabama State University 5354 0.191 0.3311989
## 5 100751 The University of Alabama 28692 0.081 0.8139413
## 6 100812 Athens State University 2999 0.094 0.7676491
## 7 100830 Auburn University at Montgomery 4322 0.131 0.6288562
## 8 100858 Auburn University 19761 0.061 0.8824554
## 9 100937 Birmingham Southern College 1181 0.099 0.8498498
## 10 101073 Concordia College Alabama 523 0.279 0.2541254
## 11 101189 Faulkner University 2358 0.120 0.5630294
## 12 101435 Huntingdon College 1100 0.105 0.7814313
## 13 101480 Jacksonville State University 7195 0.113 0.6276079
## 14 101541 Judson College 331 0.117 0.7181208
## 15 101587 University of West Alabama 1988 0.096 0.5432300
## 16 101675 Miles College 1663 0.221 0.2167889
## 17 101693 University of Mobile 1472 0.035 0.7558570
## 18 101709 University of Montevallo 2610 0.098 0.7711864
## 19 101879 University of North Alabama 5536 0.134 0.6879010
## 20 101912 Oakwood University 1854 0.107 0.4392157
## # ... with 1,418 more rows, and 9 more variables: RPY_5YR_RT <dbl>,
## # RPY_7YR_RT <dbl>, mn_earn_wne_p6_2005 <dbl>,
## # md_earn_wne_p6_2005 <dbl>, pct10_earn_wne_p6_2005 <dbl>,
## # pct25_earn_wne_p6_2005 <dbl>, pct75_earn_wne_p6_2005 <dbl>,
## # pct90_earn_wne_p6_2005 <dbl>, sd_earn_wne_p6_2005 <dbl>
## # A tibble: 1,438 x 13
## unitID College UGDS CDR3 RPY_3YR_RT
## <dbl> <fctr> <dbl> <dbl> <dbl>
## 1 100654 Alabama A & M University 4051 0.163 0.4447139
## 2 100663 University of Alabama at Birmingham 11200 0.080 0.7562667
## 3 100706 University of Alabama in Huntsville 5525 0.077 0.7819979
## 4 100724 Alabama State University 5354 0.191 0.3311989
## 5 100751 The University of Alabama 28692 0.081 0.8139413
## 6 100812 Athens State University 2999 0.094 0.7676491
## 7 100830 Auburn University at Montgomery 4322 0.131 0.6288562
## 8 100858 Auburn University 19761 0.061 0.8824554
## 9 100937 Birmingham Southern College 1181 0.099 0.8498498
## 10 101073 Concordia College Alabama 523 0.279 0.2541254
## 11 101189 Faulkner University 2358 0.120 0.5630294
## 12 101435 Huntingdon College 1100 0.105 0.7814313
## 13 101480 Jacksonville State University 7195 0.113 0.6276079
## 14 101541 Judson College 331 0.117 0.7181208
## 15 101587 University of West Alabama 1988 0.096 0.5432300
## 16 101675 Miles College 1663 0.221 0.2167889
## 17 101693 University of Mobile 1472 0.035 0.7558570
## 18 101709 University of Montevallo 2610 0.098 0.7711864
## 19 101879 University of North Alabama 5536 0.134 0.6879010
## 20 101912 Oakwood University 1854 0.107 0.4392157
## # ... with 1,418 more rows, and 8 more variables: RPY_5YR_RT <dbl>,
## # RPY_7YR_RT <dbl>, probSchool <dbl>, p_le30K <dbl>, p_gt30Kle48K <dbl>,
## # p_gt48Kle75K <dbl>, p_gt75Kle110K <dbl>, p_gt110K <dbl>
makeBF <- function(x) {x <- 1.0E-9 + x;log10(x/sum(x*earnRepay$probSchool,na.rm=TRUE))} # Approximate
earnRepayBF <- earnRepay %>%
dplyr::select(-c(1:3,5:8)) %>%
mutate_each(funs(makeBF)) %>%
setNames(paste0("BF_",names(.))) %$%
bind_cols(earnRepay[c(1:3,8)],.) %>%
filter(complete.cases(.))
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over all variables, use `mutate_all()`
earnRepayBF %>%
dplyr::select(-3) %>%
inner_join(student %>% dplyr::select(unitID,ADM_RATE),by='unitID') %>%
filter(!is.na(ADM_RATE)) %>%
mutate(ADM_RATEd=discretize(.['ADM_RATE'],method='interval',breaks=6)[['ADM_RATE']]) %>%
arrange(desc(BF_p_gt110K)) %>%
filter(BF_p_le30K<0.5 & BF_p_gt110K>0.5) %$%
qplot(x=BF_p_le30K,y=BF_p_gt110K,color=ADM_RATEd) + geom_text(aes(label=College))
#===================================================
student %$% { locale %>% summary %>% print }
## City: Large (population of 250,000 or more)
## 308
## City: Midsize (population of at least 100,000 but less than 250,000)
## 184
## City: Small (population less than 100,000)
## 214
## Rural: Distant (rural territory more than 5 miles but up to 25 miles from an urbanized area or more than 2.5 and up to 10 miles from an urban cluster)
## 24
## Rural: Fringe (rural territory up to 5 miles from an urbanized area or up to 2.5 miles from an urban cluster)
## 45
## Rural: Remote (rural territory more than 25 miles from an urbanized area and more than 10 miles from an urban cluster)
## 12
## Suburb: Large (outside principal city, in urbanized area with population of 250,000 or more)
## 255
## Suburb: Midsize (outside principal city, in urbanized area with population of at least 100,000 but less than 250,000)
## 43
## Suburb: Small (outside principal city, in urbanized area with population less than 100,000)
## 30
## Town: Distant (in urban cluster more than 10 miles and up to 35 miles from an urbanized area)
## 150
## Town: Fringe (in urban cluster up to 10 miles from an urbanized area)
## 59
## Town: Remote (in urban cluster more than 35 miles from an urbanized area)
## 114
student %$% { region %>% summary %>% print }
## Far West (AK, CA, HI, NV, OR, WA)
## 143
## Great Lakes (IL, IN, MI, OH, WI)
## 228
## Mid East (DE, DC, MD, NJ, NY, PA)
## 265
## New England (CT, ME, MA, NH, RI, VT)
## 137
## Plains (IA, KS, MN, MO, NE, ND, SD)
## 160
## Rocky Mountains (CO, ID, MT, UT, WY)
## 41
## Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)
## 354
## Southwest (AZ, NM, OK, TX)
## 110
localeNames <- sort(unique(as.character(student$locale)))
localeAggregates <- setNames(c(gsub('([^ ]+) ([^ ]+).+','\\1\\2',localeNames[1:3]),
rep('Rural',3),
gsub('([^ ]+) ([^ ]+).+','\\1\\2',localeNames[7]),
rep('Suburb:Small/Midsize & Town:Fringe',2),
gsub('([^ ]+) ([^ ]+).+','\\1\\2',localeNames[10]),
'Suburb:Small/Midsize & Town:Fringe',
gsub('([^ ]+) ([^ ]+).+','\\1\\2',localeNames[12])),localeNames)
student %<>% mutate(localeAgg = factor(localeAggregates[as.character(locale)],
levels=c('Rural','Town:Remote','Town:Distant','Suburb:Small/Midsize & Town:Fringe',
'Suburb:Large','City:Small','City:Midsize','City:Large')))
makeBF <- function(x) {x <- 1.0E-9 + x;log10(x/sum(x*settingBF$probSchool,na.rm=TRUE))} # Approximate
settingBF <- student %>%
dplyr::select(unitID,College,UGDS,localeAgg,region) %>%
mutate(probSchool = UGDS/sum(UGDS))
a <- settingBF %$% model.matrix(~localeAgg - 1,data=.)
colnames(a) <- gsub('^([^:]+):*(.+)$','BF_\\1\\2',colnames(a))
a %<>% as.data.frame %>% tbl_df %>% mutate_each(funs(makeBF))
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over all variables, use `mutate_all()`
settingBF %<>% dplyr::select(-localeAgg) %>% bind_cols(a)
a <- settingBF %>%
mutate(regionName = factor(gsub(' ','',as.character(student$region)))) %$%
model.matrix(~regionName - 1,data=.)
colnames(a) <- gsub('^regionName','BF_',colnames(a))
a %<>% as.data.frame %>% tbl_df %>% mutate_each(funs(makeBF))
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over all variables, use `mutate_all()`
settingBF %<>% dplyr::select(-region) %>% bind_cols(a)
Didn’t get around to adding in costs…
#===================================================
#costBF
#===================================================
Took a look at scaled test scores, but ultimately settled upon the log-normal fits to SATs below.
# scaleSAT <- function(Score) (Score - 280)/(800 - 280)
# scaleACT <- function(Score) (Score - 9)/( 36 - 9)
# Impute missing values with the means
#lmtest <- lm(ACTCMMID ~ SAT_AVG,data=student)
# academicsBF <- student %>%
# filter(!(isSpecialty=='TRUE' & is.na(ADM_RATE))) %>%
# dplyr::select(unitID,College,ADM_RATE,matches('(^C150)|(SAT|ACT)')) %>%
# filter(!is.na(C150_4_POOLED_SUPP)) %>%
# mutate(ADM_RATE = ifelse(is.na(ADM_RATE),mean(ADM_RATE,na.rm=TRUE),ADM_RATE)) %>%
# mutate(SAT_AVG_scaled = scaleSAT(ifelse(is.na(SAT_AVG) , mean(SAT_AVG,na.rm=TRUE), SAT_AVG))) %>%
# mutate(ACTCMMID_scaled = scaleACT(ifelse(is.na(ACTCMMID),mean(ACTCMMID,na.rm=TRUE),ACTCMMID))) %>%
# dplyr::select(1:3,ncol(.) - (2:0))
# a <- academicsBF %>%
# mutate_each(funs(scaleSAT),starts_with('SAT'),-ends_with('scaled')) %>%
# mutate_each(funs(scaleACT),starts_with('ACT'),-ends_with('scaled')) %>%
# gather(key=Test,value=scaledScore,-unitID,-College,-ADM_RATE)
NOTE: This is a bit of a kludge, but it results in a distribution for SAT Verbal+Math scores for each school, from which Bayes factors for 200-pt intervals are computed. (Kludginess: Adding the quartiles of Verbal & Math DOES NOT equal the quartiles of Verbal+Math.)
student %>%
mutate(SAT_25=SATVR25+SATMT25,SAT_75=SATVR75+SATMT75) %>%
filter(!is.na(SAT_AVG) & !is.na(SAT_75)) %>%
ggplot(aes(x=SAT_AVG)) + geom_density(fill='orange',alpha=0.3) +
geom_density(aes(x=SAT_25),fill='red',alpha=0.1) +
geom_density(aes(x=SAT_75),fill='green',alpha=0.1)
academics <- student %>%
mutate(SAT_25=SATVR25+SATMT25,SAT_75=SATVR75+SATMT75) %>%
filter(!is.na(SAT_AVG) & !is.na(SAT_75)) %>%
dplyr::select(unitID,College,UGDS,ADM_RATE,C150_4_POOLED_SUPP,SAT_25,SAT_AVG,SAT_75)
# Clumsily fit log-normal distributions to the SAT score quartiles for each school.
fr <- function(x,SAT_AVG,SAT_25,SAT_75,probs=c(0.25,0.75)) {
sdl <- x[1]
q <- qlnorm(p=probs,meanlog=log(SAT_AVG) - sdl^2/2,sdlog=sdl)
log(SAT_25/q[1])^2 + log(SAT_75/q[2])^2
}
getMuSd <- function(SAT_AVG,SAT_25,SAT_75) {
probs <- c(0.25,0.75)
SAT_AVG0 <- SAT_AVG
if(SAT_75==1600) {
probs <- probs/(0.75+1.0E-5)
SAT_AVG <- (SAT_AVG0 - 1600*(1-(0.75+1.0E-5)))/(0.75+1.0E-5)
}
soln <- optim(c(0.05), fr,lower=1.0E-3,upper=0.2,method='L-BFGS-B',
SAT_AVG=SAT_AVG,SAT_25=SAT_25,SAT_75=SAT_75,probs=probs)
sdl <- soln$par
meanlog <- log(SAT_AVG) - sdl^2/2
pSAT_AVG <- exp(meanlog+0.5*sdl^2)
q <- qlnorm(p=probs,meanlog=meanlog,sdlog=sdl)
if(SAT_75==1600){
pSAT_AVG <- pSAT_AVG*(0.75+1.0E-5) + 1600*(1-(0.75+1.0E-5))
}
return(c(meanlog=meanlog,sdlog=sdl,pSAT_25=q[1],pSAT_AVG=pSAT_AVG,pSAT_75=q[2]))
}
Get the log-normal parameters of the SAT distribution for each school
academics <- academics %$% {bind_cols(.,as.data.frame(t(mapply(getMuSd,SAT_AVG,SAT_25,SAT_75))))}
academics %>%
filter(grepl('Harvard|Cal.+Inst.+Tech|Mass.+Inst.+Tech|Princeton U|Northwestern U|Cornell U',College)) %>%
arrange(desc(SAT_AVG))
## # A tibble: 6 x 13
## unitID College UGDS ADM_RATE
## <dbl> <fctr> <dbl> <dbl>
## 1 110404 California Institute of Technology 977 0.1055
## 2 166683 Massachusetts Institute of Technology 4510 0.0815
## 3 166027 Harvard University 7278 0.0584
## 4 186131 Princeton University 5234 0.0741
## 5 147767 Northwestern University 8855 0.1532
## 6 190415 Cornell University 14309 0.1556
## # ... with 9 more variables: C150_4_POOLED_SUPP <dbl>, SAT_25 <dbl>,
## # SAT_AVG <dbl>, SAT_75 <dbl>, meanlog <dbl>, sdlog <dbl>,
## # pSAT_25 <dbl>, pSAT_AVG <dbl>, pSAT_75 <dbl>
Discretize the SAT distributions
academics %<>%
filter(!is.na(C150_4_POOLED_SUPP)) %>%
mutate(probSchool = UGDS/sum(UGDS),
p_le800 = plnorm( 800,meanlog,sdlog),
p_gt800le1000 = plnorm(1000,meanlog,sdlog) - p_le800,
p_gt1000le1200 = plnorm(1200,meanlog,sdlog) - plnorm(1000,meanlog,sdlog),
p_gt1200le1400 = plnorm(1400,meanlog,sdlog) - plnorm(1200,meanlog,sdlog),
p_gt1400 = plnorm(1400,meanlog,sdlog,lower.tail=FALSE),
totprob = p_le800 + p_gt800le1000 + p_gt1000le1200 + p_gt1200le1400 + p_gt1400) %>%
print
## # A tibble: 1,119 x 20
## unitID College UGDS ADM_RATE
## <dbl> <fctr> <dbl> <dbl>
## 1 100654 Alabama A & M University 4051 0.8989
## 2 100663 University of Alabama at Birmingham 11200 0.8673
## 3 100706 University of Alabama in Huntsville 5525 0.8062
## 4 100724 Alabama State University 5354 0.5125
## 5 100751 The University of Alabama 28692 0.5655
## 6 100858 Auburn University 19761 0.8274
## 7 100937 Birmingham Southern College 1181 0.6422
## 8 101435 Huntingdon College 1100 0.6279
## 9 101480 Jacksonville State University 7195 0.8326
## 10 101541 Judson College 331 0.7388
## # ... with 1,109 more rows, and 16 more variables:
## # C150_4_POOLED_SUPP <dbl>, SAT_25 <dbl>, SAT_AVG <dbl>, SAT_75 <dbl>,
## # meanlog <dbl>, sdlog <dbl>, pSAT_25 <dbl>, pSAT_AVG <dbl>,
## # pSAT_75 <dbl>, probSchool <dbl>, p_le800 <dbl>, p_gt800le1000 <dbl>,
## # p_gt1000le1200 <dbl>, p_gt1200le1400 <dbl>, p_gt1400 <dbl>,
## # totprob <dbl>
Compute Bayes Factors for the discretized SAT score distributions of each school
makeBF <- function(x) {x <- 1.0E-9 + x;log10(x/sum(x*academics$probSchool,na.rm=TRUE))} # Approximate
academicsBF <- academics %>%
dplyr::select(-c(1:3,matches('SAT|log|prob'))) %>%
mutate_each(funs(makeBF)) %>%
setNames(gsub("^p","BF_SAT",names(.))) %>%
setNames(gsub("^([^B])","BF_\\1",names(.))) %$%
bind_cols(academics[1:3],.)
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over all variables, use `mutate_all()`
academicsBF %>%
filter(grepl('Harvard|Cal.+Inst.+Tech|Mass.+Inst.+Tech|Princeton U|Northwestern U|Cornell U',College)) %>%
arrange(desc(BF_SAT_gt1400)) %>%
print(width=Inf)
## # A tibble: 6 x 10
## unitID College UGDS BF_ADM_RATE BF_C150_4_POOLED_SUPP BF_SAT_le800 BF_SAT_gt800le1000 BF_SAT_gt1000le1200 BF_SAT_gt1200le1400 BF_SAT_gt1400
## <dbl> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 110404 California Institute of Technology 977 -0.7718180 0.1903925 -7.763154 -8.424919 -8.5624697 -7.2934024 1.0517077
## 2 166027 Harvard University 7278 -1.0286576 0.2119162 -7.763154 -8.424919 -8.5624697 -1.2155538 1.0457960
## 3 186131 Princeton University 5234 -0.9252523 0.2086179 -7.763154 -8.424919 -8.5624697 -0.8427212 1.0376274
## 4 166683 Massachusetts Institute of Technology 4510 -0.8839129 0.1924128 -7.763154 -7.870078 -2.7437355 -0.1406519 0.9753106
## 5 147767 Northwestern University 8855 -0.6098117 0.1952930 -7.763144 -5.374484 -1.6772526 0.1449584 0.8856398
## 6 190415 Cornell University 14309 -0.6030609 0.1942822 -6.384011 -2.780483 -0.8064331 0.2595414 0.7831573
Now add in the disciplines:
makeBF <- function(x) {x <- 1.0E-9 + x;log10(x/sum(x*academics$probSchool,na.rm=TRUE))} # Approximate
discplnsBF <- academics %>% dplyr::select(unitID,probSchool) %>% inner_join(disciplines2013,by='unitID') %>%
mutate_each(funs(makeBF),-seq_len(which(names(.) == discNames[1])-1)) %>%
setNames(c(setdiff(names(.),discNames),paste0('BF_',discNames)))
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over a selection of variables, use `mutate_at()`
academicsBF %<>% inner_join(discplnsBF %>% dplyr::select(unitID,starts_with('BF_')),by='unitID')
At this point, we’ve got enough pieces of the puzzle to construct a working model that demonstrates the concept of a decision analysis tool that takes as input a student’s demographics and behavioral traits and the dataset of college Bayes factors and outputs a list of the top-N colleges maximizing the utility for that specific student.
First join the separate data tables of Bayes factors into a single table.
# NOW: Build the model covariate data frame with all possible covariate
# candidates. Each student's model only uses a subset of these columns. See
# function 'utility'....
studentBF <- student %>% dplyr::select(unitID,College,UGDS,ADM_RATE) %>%
inner_join(incomeBF %>% dplyr::select(-College,-UGDS), by='unitID') %>%
inner_join(ethnicBF %>% dplyr::select(-College), by='unitID') %>%
inner_join(aidBF %>% dplyr::select(-College,-UGDS), by='unitID') %>%
inner_join(settingBF %>% dplyr::select(-College,-UGDS,-probSchool), by='unitID') %>%
inner_join(earnRepayBF %>% dplyr::select(-College,-UGDS,-probSchool), by='unitID') %>%
# inner_join(costBF %>% dplyr::select(-College,-UGDS), by='unitID') %>%
inner_join(academicsBF %>% dplyr::select(-College,-UGDS), by='unitID') %>%
mutate(BF_prior = log10(nrow(.)*probSchool)) %>%
print(n=20)
## # A tibble: 1,119 x 97
## unitID College UGDS ADM_RATE probSchool
## <dbl> <fctr> <dbl> <dbl> <dbl>
## 1 100654 Alabama A & M University 4051 0.8989 5.165870e-04
## 2 100663 University of Alabama at Birmingham 11200 0.8673 1.428234e-03
## 3 100706 University of Alabama in Huntsville 5525 0.8062 7.045528e-04
## 4 100724 Alabama State University 5354 0.5125 6.827467e-04
## 5 100751 The University of Alabama 28692 0.5655 3.658829e-03
## 6 100858 Auburn University 19761 0.8274 2.519940e-03
## 7 100937 Birmingham Southern College 1181 0.6422 1.506021e-04
## 8 101435 Huntingdon College 1100 0.6279 1.402730e-04
## 9 101480 Jacksonville State University 7195 0.8326 9.175126e-04
## 10 101541 Judson College 331 0.7388 4.220941e-05
## 11 101693 University of Mobile 1472 0.7125 1.877107e-04
## 12 101709 University of Montevallo 2610 0.8729 3.328295e-04
## 13 101879 University of North Alabama 5536 0.8092 7.059555e-04
## 14 101912 Oakwood University 1854 0.3435 2.364237e-04
## 15 102049 Samford University 2995 0.7697 3.819250e-04
## 16 102094 University of South Alabama 11048 0.8604 1.408851e-03
## 17 102234 Spring Hill College 1269 0.4627 1.618240e-04
## 18 102270 Stillman College 863 0.4300 1.100505e-04
## 19 102377 Tuskegee University 2584 0.3511 3.295139e-04
## 20 102669 Alaska Pacific University 340 0.3745 4.335709e-05
## # ... with 1,099 more rows, and 92 more variables:
## # BF_DEP_STAT_PCT_INDx <dbl>, BF_DEP_STAT_PCT_DEPx <dbl>,
## # BF_DEP_INC_PCT_LOx <dbl>, BF_DEP_INC_PCT_HI2x <dbl>,
## # BF_DEP_INC_GTLO_LTH2x <dbl>, BF_IND_INC_PCT_LOx <dbl>,
## # BF_IND_INC_PCT_HI2x <dbl>, BF_IND_INC_GTLO_LTH2x <dbl>,
## # BF_WHITE <dbl>, BF_BLACK <dbl>, BF_HISP <dbl>, BF_ASIAN <dbl>,
## # BF_AIAN <dbl>, BF_NHPI <dbl>, BF_2MOR <dbl>, BF_NRA <dbl>,
## # BF_UNKN <dbl>, BF_female_2005 <dbl>, BF_pell_ever_2005 <dbl>,
## # BF_fsend_1_2005 <dbl>, BF_fsend_2_2005 <dbl>, BF_fsend_3_2005 <dbl>,
## # BF_fsend_4_2005 <dbl>, BF_fsend_5_2005 <dbl>, BF_localeAggRural <dbl>,
## # BF_localeAggTownRemote <dbl>, BF_localeAggTownDistant <dbl>,
## # `BF_localeAggSuburbSmall/Midsize & Town:Fringe` <dbl>,
## # BF_localeAggSuburbLarge <dbl>, BF_localeAggCitySmall <dbl>,
## # BF_localeAggCityMidsize <dbl>, BF_localeAggCityLarge <dbl>,
## # `BF_FarWest(AK,CA,HI,NV,OR,WA)` <dbl>,
## # `BF_GreatLakes(IL,IN,MI,OH,WI)` <dbl>,
## # `BF_MidEast(DE,DC,MD,NJ,NY,PA)` <dbl>,
## # `BF_NewEngland(CT,ME,MA,NH,RI,VT)` <dbl>,
## # `BF_Plains(IA,KS,MN,MO,NE,ND,SD)` <dbl>,
## # `BF_RockyMountains(CO,ID,MT,UT,WY)` <dbl>,
## # `BF_Southeast(AL,AR,FL,GA,KY,LA,MS,NC,SC,TN,VA,WV)` <dbl>,
## # `BF_Southwest(AZ,NM,OK,TX)` <dbl>, BF_CDR3 <dbl>, BF_p_le30K <dbl>,
## # BF_p_gt30Kle48K <dbl>, BF_p_gt48Kle75K <dbl>, BF_p_gt75Kle110K <dbl>,
## # BF_p_gt110K <dbl>, BF_ADM_RATE <dbl>, BF_C150_4_POOLED_SUPP <dbl>,
## # BF_SAT_le800 <dbl>, BF_SAT_gt800le1000 <dbl>,
## # BF_SAT_gt1000le1200 <dbl>, BF_SAT_gt1200le1400 <dbl>,
## # BF_SAT_gt1400 <dbl>, BF_AgricultureAgriculture <dbl>,
## # BF_NaturalResources <dbl>, BF_ArchitectureRelated <dbl>,
## # BF_AreaEthnic <dbl>, BF_CommunicationJournalism <dbl>,
## # BF_CommunicationsTechnologies <dbl>, BF_ComputerInformation <dbl>,
## # BF_PersonalCulinary <dbl>, BF_Education <dbl>, BF_Engineering <dbl>,
## # BF_EngineeringTechnologies <dbl>, BF_ForeignLanguages <dbl>,
## # BF_FamilyConsumer <dbl>, BF_LegalProfessions <dbl>,
## # BF_EnglishLanguage <dbl>, BF_LiberalArts <dbl>,
## # BF_LibraryScience <dbl>, BF_BiologicalBiomedical <dbl>,
## # BF_MathematicsStatistics <dbl>, BF_MilitaryTechnologies <dbl>,
## # BF_MultiInterdisciplinary <dbl>, BF_ParksRecreation <dbl>,
## # BF_PhilosophyReligious <dbl>, BF_TheologyReligious <dbl>,
## # BF_PhysicalSciences <dbl>, BF_ScienceTechnologies <dbl>,
## # BF_Psychology <dbl>, BF_HomelandSecurity <dbl>,
## # BF_PublicAdministration <dbl>, BF_SocialSciences <dbl>,
## # BF_ConstructionTrades <dbl>, BF_MechanicRepair <dbl>,
## # BF_PrecisionProduction <dbl>, BF_TransportationMaterials <dbl>,
## # BF_VisualPerforming <dbl>, BF_HealthProfessions <dbl>,
## # BF_BusinessManagement <dbl>, BF_History <dbl>, BF_prior <dbl>
Get the coefficients for the student profile: Two levels of coefficients: 1.Property Category level, 2. Subcategory level
Trait level has 4 types with impact on properties as follows when each trait increases from low to medium to high:
Could also add in more or less uncertainty to beta’s according to traits as well…
Traits Have Impact On Four Property Categories Listed in Rows:
| Risk | Vision | Breadth | Challenge | Properties | |
|---|---|---|---|---|---|
| 1. Self & Origins | same to differ | —— | ——- | —- | ethn,inc,1stgen,usbrn,pov |
| 2. Campus Setting & Folks | homog to diverse | —— | ——- | —- | reg,loc,siz |
| 3. Academics | —- | —— | narrow to broad | low to high sat/adm | sat,admrt,disc |
| 4. After-College Prospects | —- | low to high dbt/ern | ——- | —- | comprt,earn,rpy,dfltrt |
# This maps the student profile property names to the college Bayes factor data table ('studentBF') column names.
propertyMap <- list(
ethnicity = c(white='BF_WHITE',black='BF_BLACK',hispanic='BF_HISP',asian='BF_ASIAN'),
income = list(dependent = c(lt30K="BF_DEP_INC_PCT_LOx",gt30Kle110K="BF_DEP_INC_GTLO_LTH2x",gt110K="BF_DEP_INC_PCT_HI2x"),
independent = c(lt30K="BF_IND_INC_PCT_LOx",gt30Kle110K="BF_IND_INC_GTLO_LTH2x",gt110K="BF_IND_INC_PCT_HI2x")),
sat = c(le800='BF_SAT_le800', gt800le1000='BF_SAT_gt800le1000', gt1000le1200='BF_SAT_gt1000le1200',
gt1200le1400='BF_SAT_gt1200le1400', gt1400='BF_SAT_gt1400'),
discipline = setNames(paste0('BF_',discNames),discNames),
fasfa = c(fsend_1='BF_fsend_1_2005',fsend_2='BF_fsend_2_2005',fsend_3='BF_fsend_3_2005',fsend_4='BF_fsend_4_2005',fsend_5='BF_fsend_5_2005'),
region = c(FarWest='BF_FarWest(AK,CA,HI,NV,OR,WA)',GreatLakes='BF_GreatLakes(IL,IN,MI,OH,WI)',MidEast='BF_MidEast(DE,DC,MD,NJ,NY,PA)',
NewEngland='BF_NewEngland(CT,ME,MA,NH,RI,VT)',Plains='BF_Plains(IA,KS,MN,MO,NE,ND,SD)',RockyMountains='BF_RockyMountains(CO,ID,MT,UT,WY)',
Southeast='BF_Southeast(AL,AR,FL,GA,KY,LA,MS,NC,SC,TN,VA,WV)',Southwest='BF_Southwest(AZ,NM,OK,TX)'),
locale = c(Rural='BF_localeAggRural',TownRemote='BF_localeAggTownRemote',TownDistant='BF_localeAggTownDistant',
SuburbSmallMid='BF_localeAggSuburbSmall/Midsize & Town:Fringe',SuburbLarge='BF_localeAggSuburbLarge',
CitySmall='BF_localeAggCitySmall',CityMidsize='BF_localeAggCityMidsize',CityLarge='BF_localeAggCityLarge'),
earnings = c(le30K='BF_p_le30K', gt30Kle48K='BF_p_gt30Kle48K', gt48Kle75K='BF_p_gt48Kle75K', gt75Kle110K='BF_p_gt75Kle110K', gt110K='BF_p_gt110K' ),
prior = 'BF_prior',
completion = 'BF_C150_4_POOLED_SUPP',
admission = 'BF_ADM_RATE',
gender = c(female='BF_female_2005',male='BF_female_2005')
)
A critical determinant of how well the tool works is the capturing of specific student value assessments of college properties. This is done using parameters beta in the utility function. The function below does this.
This is just to show one of the many ways in which this can be done.
getParameters <- function(studentProfile,propertyMap){
# Get the coefficients for the student profile:
propCategory <- c('SelfOrigin','Setting','Academics','Prospects')
traitLvls <- setNames(as.list(studentProfile$traits),c('Risk','Vision','Breadth','Challenge'))
propsByTrait <- list(Risk = list(SelfOrigin = c('ethnicity','income'),
Setting = c('region','locale')),
Vision = list(Prospects = c('completion','earnings'),
Setting = c('region','locale')),
Breadth = list(Academics = c('discipline')),
Challenge = list(Academics = c('sat','admission','discipline')))
# beta <- matrix(rep(1,length(xBF)),ncol=1,dimnames=list(c('ethnicity','income','sat','fasfa'),'Utility'))
beta <- matrix(rep(1,length(propertyMap)),ncol=1)
rownames(beta) <- names(propertyMap)
beta['admission',1] <- -1.0 # prefer schools that have lower admissions rates
beta['locale' ,1] <- 0.5 # downweight locale
beta['fasfa' ,1] <- 0.3 # downweight fasfa
# assumes stronger influence of gender for women than men...
beta['gender' ,1] <- ifelse(studentProfile$gender == 'female',0.5,-0.1)
# (^_^) ...there's gotta be a better way of doing this...
for(pCat in propCategory){
prpnm <- propsByTrait$Risk[[pCat]]
if (!is.null(prpnm)){
# must have vectors for each trait level with as many elements as are in the vectors of the propsByTrait list
beta[prpnm,1] <- beta[prpnm,1] * switch(traitLvls$Risk,
L=list(SelfOrigin=c(ethnicity=3.0,income=2.0),
Setting=c(region=3.0,locale=3.0))[[pCat]],
M=list(SelfOrigin=c(ethnicity=0.3,income=0.3),
Setting=c(region=1.0,locale=1.0))[[pCat]],
H=list(SelfOrigin=c(ethnicity=0.1,income=0.1),
Setting=c(region=0.1,locale=0.1))[[pCat]])
}
prpnm <- propsByTrait$Vision[[pCat]]
if (!is.null(prpnm)){
beta[prpnm,1] <- beta[prpnm,1] * switch(traitLvls$Vision,
L=list(Prospects=c(completion=0.1,earnings=0.1),
Setting = c(region=1.5,locale=1.5))[[pCat]],
M=list(Prospects=c(completion=1.0,earnings=0.5),
Setting = c(region=1.0,locale=1.0))[[pCat]],
H=list(Prospects=c(completion=3.0,earnings=3.0),
Setting = c(region=0.5,locale=0.5))[[pCat]])
}
prpnm <- propsByTrait$Breadth[[pCat]]
if (!is.null(prpnm)){
beta[prpnm,1] <- beta[prpnm,1] * switch(traitLvls$Breadth,
L=list(Academics=c(discipline=3.0))[[pCat]],
M=list(Academics=c(discipline=1.0))[[pCat]],
H=list(Academics=c(discipline=0.3))[[pCat]])
}
prpnm <- propsByTrait$Challenge[[pCat]]
if (!is.null(prpnm)){
beta[prpnm,1] <- beta[prpnm,1] * switch(traitLvls$Challenge,
L=list(Academics=c(sat=0.1,admission=0.1,discipline=0.3))[[pCat]],
M=list(Academics=c(sat=1.0,admission=1.0,discipline=1.0))[[pCat]],
H=list(Academics=c(sat=3.0,admission=3.0,discipline=3.0))[[pCat]])
}
}
# Also, make the beta's of traits sum to 1 at each level so that final utility
# is roughly same magnitude as a typical Bayes factor for school properties.
# This facilitates interpretation of utility in similar fashion as a raw Bayes factor.
beta[,1] <- beta[,1]/sum(beta[,1])
return(beta)
}
Now the utility function of the discrete choice model:
#===================================================================
# Here's the Decision Analysis Model: Probabilistic, Utility-Based Discrete Choice Model
utility <- function(studentBF,studentProfile,propertyMap,dump.covariates=FALSE){
# studentBF must come in as a 1-row matrix!!!
covarnames <- sapply(names(propertyMap),function(propnm){
if(propnm == 'income'){
return(propertyMap$income[[ifelse(studentProfile$dependent,
'dependent',
'independent')]][studentProfile[[propnm]]])
}
levelName <- studentProfile[[propnm]]
if(is.null(levelName)) return(propertyMap[[propnm]])
return(propertyMap[[propnm]][levelName])
})
if(is.null(dim(studentBF))) {
xBF <- matrix(studentBF[covarnames],nrow=1,dimnames=list(NULL,covarnames))
} else {
xBF <- matrix(studentBF[,covarnames],nrow=1,dimnames=list(NULL,covarnames))
}
# Assign useful column names...
colnames(xBF) <- paste(colnames(xBF),names(covarnames),sep=':')
#show(xBF)
# Compute the utility the student derives from the school.
beta <- setNames(studentProfile$beta,colnames(xBF))
u <- xBF %*% beta
if (dump.covariates){
wxBF <- xBF * matrix(beta,nrow=nrow(xBF),ncol=ncol(xBF),byrow = TRUE)
colnames(wxBF) <- colnames(xBF)
return(wxBF)
} else {
return(u[1])
}
}
We now specify student profiles and test out the model to judge how reasonable are its rankings of colleges per specific student criteria.
#=====================================================
# Define (3 levels)^(4 traits) = 3^4 = 81 student trait archetypes:
# Trait level has 4 categories with impact on subtraits as follows when each
# trait increases from low to medium to high:
# 1. Risk -- willingness to surround self with people & settings very different from self & origins;
# i. Weights homogeneity & sameness as self lower than other traits.
# 2. Vision -- willingness to look past short-term suitability in pursuit of future;
# i. Weights completion rates, future earnings, debt & debt repayment rates higher
# than current campus setting compatability.
# 3. Breadth -- willingness to entertain a variety of academic disciplines;
# i. Weights predominance of own major discipline lower than otherwise.
# 4. Challenge -- willingness to embrace highest academic rigor
# i. Weights admissions selectiveness, high SAT & higher than otherwise.
makeTraitLabels <- function(trait,ntrait,levels){
# Note that this must be called with positive integers for 'trait' & 'ntrait' and ntrait <= length(levels)
# (with intention to use ntrait==length(levels)) where 'levels' is a list of character vectors.
# Will return a character vector with length = prod(sapply(levels,length)).
if(trait>=ntrait) return(levels[[trait]])
return(c(sapply(levels[[trait]],paste0,makeTraitLabels(trait+1,ntrait,levels))))
}
traitLevels <- list(Risk=c('Lr','Mr','Hr'),Vision=c('Lv','Mv','Hv'),Breadth=c('Lb','Mb','Hb'),Challenge=c('Lc','Mc','Hc'))
traitLabels <- makeTraitLabels(trait=1,ntrait=length(traitLevels),levels=traitLevels)
StudentTrait <- setNames(strsplit(traitLabels,'[a-z]'),traitLabels)
#=====================================================
Modal States of Each Student Property:
a <- studentBF %>% dplyr::select(probSchool,starts_with('BF')) %>% as.matrix
meanBF <- t(a[,1] %*% a[,-1])
# *** PROBLEMATIC FOR MAINTENANCE & REVISION: MUST RESET THESE IF ANY COLUMNS CHANGE IN studentBF ***
inm <- setNames(c(2,6,9,1,1,5,8,8,1,5,1,1,5,38),
c('dependent','income','ethnicity','gender','pellgrant','fasfa',
'locale','region','defaultrate','earnings','admissions',
'completion','sat','discipline'))
propNames <- mapply(function(i,si) rownames(meanBF)[seq(si-i+1,length.out=i)],inm,cumsum(inm))
domProps <- sapply(propNames,function(nms) names(meanBF[nms,])[which.max(meanBF[nms,])])
show(domProps)
## dependent
## "BF_DEP_STAT_PCT_DEPx"
## income
## "BF_DEP_INC_GTLO_LTH2x"
## ethnicity
## "BF_WHITE"
## gender
## "BF_female_2005"
## pellgrant
## "BF_pell_ever_2005"
## fasfa
## "BF_fsend_3_2005"
## locale
## "BF_localeAggCityLarge"
## region
## "BF_Southeast(AL,AR,FL,GA,KY,LA,MS,NC,SC,TN,VA,WV)"
## defaultrate
## "BF_CDR3"
## earnings
## "BF_p_gt30Kle48K"
## admissions
## "BF_ADM_RATE"
## completion
## "BF_C150_4_POOLED_SUPP"
## sat
## "BF_SAT_gt1000le1200"
## discipline
## "BF_Psychology"
Create a few specific student profiles
(Check function ‘utility’ for valid entries for the student profile properties: valid entries appear as the names() of vectors ‘ethlist’, ‘inclist’, etc.)
students <- list(
Black = list(
dependent = TRUE,
ethnicity = 'black',
gender = 'female',
age = 'le24',
income = 'gt30Kle110K',
earnings = 'gt110K',
sat = 'gt1000le1200',
fasfa = 'fsend_4',
discipline= 'SocialSciences',
region = 'MidEast',
locale = 'CityLarge',
traits = StudentTrait$MrMvMbMc
),
White = list(
dependent = TRUE,
ethnicity = 'white',
gender = 'female',
age = 'le24',
income = 'gt30Kle110K',
earnings = 'gt110K',
sat = 'gt1000le1200',
fasfa = 'fsend_4',
discipline= 'SocialSciences',
region = 'MidEast',
locale = 'CityLarge',
traits = StudentTrait$MrMvMbMc
),
Hispanic = list(
dependent = TRUE,
ethnicity = 'hispanic',
gender = 'female',
age = 'le24',
income = 'gt30Kle110K',
earnings = 'gt110K',
sat = 'gt1000le1200',
fasfa = 'fsend_4',
discipline= 'SocialSciences',
region = 'MidEast',
locale = 'CityLarge',
traits = StudentTrait$MrMvMbMc
),
Asian = list(
dependent = TRUE,
ethnicity = 'asian',
gender = 'female',
age = 'le24',
income = 'gt30Kle110K',
earnings = 'gt110K',
sat = 'gt1000le1200',
fasfa = 'fsend_4',
discipline= 'SocialSciences',
region = 'MidEast',
locale = 'CityLarge',
traits = StudentTrait$MrMvMbMc
))
Let’s look at plots of the model parameters for a few specific student types:
# Test getParameters:
plotBeta <- function(beta,stdtProf){
trtstr <- paste(paste(c('Risk','Vision','Breadth','Challenge'),stdtProf$traits,sep=":"),collapse=', ')
b <- data_frame(Trait=factor(rownames(beta),levels=rownames(beta)[order(beta[,1],decreasing=TRUE)]),Value=beta[,1])
gplt <- b %>%
ggplot(aes(x=Trait,y=Value,fill=Trait))
(gplt +
geom_bar(stat='identity',position='dodge') +
#theme(text=element_text(face='bold')) +
scale_y_continuous(lim=c(-1,1)/2) +
ggtitle(sprintf("Utility Partworths for %s",trtstr))) %>% print
}
Note how the parameters change with the behavior traits of Risk, Vision, Breadth & Challenge.
stdtProf <- students$White
beta1 <- getParameters(stdtProf,propertyMap)
plotBeta(beta1,stdtProf)
stdtProf <- students$White
stdtProf$traits <- StudentTrait$HrLvLbHc
beta2 <- getParameters(stdtProf,propertyMap)
plotBeta(beta2,stdtProf)
## Warning: Removed 1 rows containing missing values (geom_bar).
stdtProf <- students$White
stdtProf$traits <- StudentTrait$LrHvHbLc
beta3 <- getParameters(stdtProf,propertyMap)
plotBeta(beta3,stdtProf)
What follows is function studentCaseStudy the ‘work horse’ function to provide the student with personalized college rankings:
# Define a function to assess the suitability of colleges for a specific student's profile:
studentCaseStudy <- function(studentBF,stdtProf,propertyMap,ntop=20,signifLevel=0.15,verbose=FALSE,exact=FALSE){
utilities <- studentBF %>%
dplyr::select(starts_with('BF')) %>%
apply(1,utility,studentProfile=stdtProf,propertyMap=propertyMap) %>%
setNames(studentBF$College)
tmp <- studentBF %>%
dplyr::select(starts_with('BF')) %>%
slice(1) %>% as.matrix %>%
utility(studentProfile=stdtProf,propertyMap=propertyMap,dump.covariates=TRUE)
wxBF <- studentBF %>%
dplyr::select(starts_with('BF')) %>%
apply(1,utility,studentProfile=stdtProf,propertyMap=propertyMap,dump.covariates=TRUE) %>%
t %>% as.data.frame %>% tbl_df %>%
setNames(gsub('^[^:]+:','',colnames(tmp))) %>%
mutate(unitID = studentBF$unitID,
College = studentBF$College,
Utility = utilities) %>%
arrange(desc(Utility)) %>%
dplyr::select(unitID,College,Utility,everything())
labels <- sprintf("%d. %s",order(order(utilities,decreasing=TRUE)),names(utilities))
stdtUtility <- data_frame(unitID = studentBF$unitID,
College = names(utilities),
Utility = utilities) %>%
mutate(labels = factor(labels,levels=labels[order(Utility,decreasing=TRUE)]),
expChoice = 10^Utility,
ChoicePct = 100*expChoice/sum(expChoice)) %>%
dplyr::select(-expChoice) %>%
inner_join(wxBF %>% dplyr::select(-College,-Utility),by='unitID') %>%
arrange(desc(Utility)) #%T>% print
topN <- stdtUtility %>% dplyr::select(-College) %>% top_n(ntop,Utility)
if(verbose) {
topN %>% print
traitStr <- paste(paste(c('Risk','Vision','Breadth','Challenge'),stdtProf$traits,sep=":"),collapse=', ')
profText <- stdtProf[!grepl('beta|traits',names(stdtProf))]
pltTitle <- paste(strwrap(paste(paste(names(profText),profText,sep=':'),collapse=', '),width = 80),collapse='\n')
pltTitle <- paste0("Top ",ntop," Colleges for Profile:\n",pltTitle,'\n',traitStr)
gplt0 <- topN %>%
mutate(labels = factor(labels,levels=rev(levels(labels))),
labelY = pmin(0,min(Utility))) %>%
ggplot(aes(x=labels,y=Utility,fill=labels))
gplt <- gplt0 +
geom_bar(stat='identity',position='dodge') + coord_flip() +
geom_text(aes(x=labels,y=labelY,label=labels),size=6,hjust=0,vjust=0.5) +
theme(text=element_text(face='bold',size=8),
axis.text.y=element_blank(),
title=element_text(size=12,face='bold'),
legend.position = "none") +
ggtitle(pltTitle) +
labs(x=sprintf("Top %d Colleges",ntop))
gplt %>% print
}
# Back-calculate the beta coefficients and see if can determine which
# covariate contributed most strongly to the utility rankings.
stdtBF <- stdtUtility %>%
inner_join(studentBF %>% dplyr::select(-College),by='unitID')
axbf <- stdtBF %>% dplyr::select(one_of(gsub('^[^:]+:','',colnames(tmp)))) %>% as.matrix
betacoeff <- solve(t(axbf) %*% axbf,t(axbf) %*% matrix(stdtUtility$Utility,ncol=1))
if(verbose) show(betacoeff[betacoeff>1.0E-9,])
stdtBF %<>%
dplyr::select(c(4,3,5),one_of(names(betacoeff[abs(betacoeff)>1.0E-9,])))
if(verbose) stdtBF %>% print(n=ntop,width=Inf)
utlBFcor <- stdtBF %$%
lapply(.[-(1:3)],function(y) cor.test(Utility[1:ntop],y[1:ntop],method='spearman',exact=exact)) #%T>% print
# This shows which covariates dictate the rankings by utility. If significant
# negative correlation, then that covariate is important in the latter half of
# the ranking (note that reverse ordering due to negative coefficients is
# already accounted for).
pvals <- sapply(utlBFcor,'[[','p.value')
pvals <- pvals[!is.na(pvals)]
signifCor <- utlBFcor[pvals <= pmax(signifLevel,min(pvals,na.rm=TRUE))] %$% {.[order(sapply(.,'[[','p.value'))]}
if(verbose) signifCor %>% show
results <- list(student=stdtProf,topN=topN,utilities=stdtUtility,BF=stdtBF,cor=utlBFcor,signifCor=signifCor,beta=betacoeff)
invisible(results)
}
Ideally, we’d build a nice GUI/app around this model so that a student can enter his or her specific information and get a custom ranking of colleges. Here, we’ll just manually specify a student and assess the results:
# Assess a specific student's case:
stdtProf <- students$Black
#stdtProf$gender <- 'male'
stdtProf$beta <- getParameters(stdtProf,propertyMap)
caseResult <- studentCaseStudy(studentBF,stdtProf,propertyMap,verbose=TRUE,ntop=20)
## # A tibble: 20 x 16
## unitID Utility labels
## <dbl> <dbl> <fctr>
## 1 131159 0.3505585 1. American University
## 2 131469 0.3465406 2. George Washington University
## 3 193900 0.3294539 3. New York University
## 4 131496 0.3248710 4. Georgetown University
## 5 190664 0.3199828 5. CUNY Queens College
## 6 190594 0.3049034 6. CUNY Hunter College
## 7 131520 0.2947132 7. Howard University
## 8 215293 0.2683836 8. University of Pittsburgh-Pittsburgh Campus
## 9 216339 0.2673033 9. Temple University
## 10 191241 0.2550179 10. Fordham University
## 11 190637 0.2513853 11. CUNY Lehman College
## 12 215062 0.2511576 12. University of Pennsylvania
## 13 189097 0.2469422 13. Barnard College
## 14 190512 0.2399130 14. CUNY Bernard M Baruch College
## 15 190567 0.2286066 15. CUNY City College
## 16 190549 0.2281797 16. CUNY Brooklyn College
## 17 195809 0.2274490 17. St John's University-New York
## 18 190150 0.2245557 18. Columbia University in the City of New York
## 19 190600 0.2183074 19. CUNY John Jay College of Criminal Justice
## 20 186399 0.2127885 20. Rutgers University-Newark
## # ... with 13 more variables: ChoicePct <dbl>, ethnicity.black <dbl>,
## # income.gt30Kle110K <dbl>, sat.gt1000le1200 <dbl>,
## # discipline.SocialSciences <dbl>, fasfa.fsend_4 <dbl>,
## # region.MidEast <dbl>, locale.CityLarge <dbl>, earnings.gt110K <dbl>,
## # prior <dbl>, completion <dbl>, admission <dbl>, gender.female <dbl>
## ethnicity.black income.gt30Kle110K
## 1 1
## sat.gt1000le1200 discipline.SocialSciences
## 1 1
## fasfa.fsend_4 region.MidEast
## 1 1
## locale.CityLarge earnings.gt110K
## 1 1
## prior completion
## 1 1
## admission gender.female
## 1 1
## # A tibble: 1,119 x 15
## labels Utility ChoicePct ethnicity.black income.gt30Kle110K sat.gt1000le1200 discipline.SocialSciences fasfa.fsend_4 region.MidEast locale.CityLarge earnings.gt110K prior completion admission gender.female
## <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1. American University 0.3505585 1.740040 -0.011261582 0.010633611 -0.005972833 0.109420682 0.0119673501 0.1211888 0.04571754 0.024259747 -0.001754930 0.018397979 0.025061210 2.900901e-03
## 2 2. George Washington University 0.3465406 1.724017 -0.010852213 0.008589604 -0.027228160 0.091504238 0.0054844109 0.1211888 0.04571754 0.024593490 0.024942931 0.020047113 0.043047656 -4.948562e-04
## 3 3. New York University 0.3294539 1.657505 -0.017677039 -0.005417375 -0.057770376 0.041273125 0.0027660389 0.1211888 0.04571754 0.035236322 0.078158782 0.023282996 0.059259808 3.435231e-03
## 4 4. Georgetown University 0.3248710 1.640106 -0.011325299 0.008161051 -0.115293974 0.095944416 0.0054844109 0.1211888 0.04571754 0.056551517 0.002407464 0.030022356 0.088341977 -2.329315e-03
## 5 5. CUNY Queens College 0.3199828 1.621749 -0.008638860 -0.012695869 0.034202974 0.066853462 -0.0040837124 0.1211888 0.04571754 -0.005113733 0.048983914 -0.004744205 0.035411575 2.900901e-03
## 6 6. CUNY Hunter College 0.3049034 1.566406 -0.002115287 -0.010535503 0.020664480 0.043782752 0.0027660389 0.1211888 0.04571754 -0.013899521 0.054723124 -0.014661894 0.050315889 6.956979e-03
## 7 7. Howard University 0.2947132 1.530080 0.042963634 0.004367055 0.016306909 0.017057312 0.0119673501 0.1211888 0.04571754 0.012589169 -0.001497064 0.002587864 0.015972434 5.492180e-03
## 8 8. University of Pittsburgh-Pittsburgh Campus 0.2683836 1.440073 -0.014495363 0.009664556 -0.007969455 0.008909767 -0.0040837124 0.1211888 0.04571754 0.019818547 0.065149585 0.019644596 0.007168040 -2.329315e-03
## 9 9. Temple University 0.2673033 1.436495 0.003963264 -0.002928684 0.017138871 -0.017771861 0.0078821866 0.1211888 0.04571754 -0.005530387 0.092799642 0.006936569 -0.001597878 -4.948562e-04
## 10 10. Fordham University 0.2550179 1.396429 -0.019547693 -0.008971455 -0.005972833 0.053166857 0.0078821866 0.1211888 0.04571754 0.007430305 0.010602233 0.020533599 0.019027040 3.961276e-03
## 11 11. CUNY Lehman College 0.2513853 1.384797 0.018762441 -0.023215424 0.017636355 0.038486754 -0.0003720923 0.1211888 0.04571754 -0.024735980 0.017127792 -0.034460746 0.066888675 8.361150e-03
## 12 12. University of Pennsylvania 0.2511576 1.384071 -0.008555141 0.006231782 -0.185599333 0.045106055 -0.0040837124 0.1211888 0.04571754 0.061474646 0.028119212 0.032139192 0.110513918 -1.095389e-03
## 13 13. Barnard College 0.2469422 1.370702 -0.014909684 0.015408227 -0.053061621 0.078162061 -0.0003720923 0.1211888 0.04571754 0.003863905 -0.070326178 0.027615745 0.075418992 1.823642e-02
## 14 14. CUNY Bernard M Baruch College 0.2399130 1.348695 -0.003807438 -0.016592599 0.007542255 -0.038546347 0.0027660389 0.1211888 0.04571754 0.009227264 0.045479453 0.005518720 0.062514653 -1.095389e-03
## 15 15. CUNY City College 0.2286066 1.314037 0.010126127 -0.014222492 0.011818418 0.017979735 0.0054844109 0.1211888 0.04571754 -0.018499901 0.035473252 -0.023836359 0.040986893 -3.609819e-03
## 16 16. CUNY Brooklyn College 0.2281797 1.312745 0.013754144 -0.013128303 0.021899884 -0.037986686 -0.0003720923 0.1211888 0.04571754 -0.001532219 0.039944175 -0.009006110 0.045342506 2.358022e-03
## 17 17. St John's University-New York 0.2274490 1.310539 0.011219776 0.006049773 0.010115935 -0.033930189 0.0100270684 0.1211888 0.04571754 0.017525892 0.031495209 -0.002913229 0.010857113 9.523216e-05
## 18 18. Columbia University in the City of New York 0.2245557 1.301837 -0.008059933 -0.015046686 -0.230889587 0.070639611 0.0027660389 0.1211888 0.04571754 0.055788417 0.008729634 0.030949403 0.144479154 -1.706743e-03
## 19 19. CUNY John Jay College of Criminal Justice 0.2183074 1.283241 0.011251379 -0.011361383 -0.015543043 0.036841619 -0.0003720923 0.1211888 0.04571754 -0.014592864 0.041079485 -0.039889404 0.041086466 2.900901e-03
## 20 20. Rutgers University-Newark 0.2127885 1.267037 0.010533622 -0.003568755 0.017435291 -0.004135841 0.0027660389 0.1211888 0.04571754 0.012868068 -0.002564624 0.005096916 0.009780675 -2.329315e-03
## # ... with 1,099 more rows
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
## $discipline.SocialSciences
##
## Spearman's rank correlation rho
##
## data: Utility[1:ntop] and y[1:ntop]
## S = 616, p-value = 0.01466
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.5368421
##
##
## $ethnicity.black
##
## Spearman's rank correlation rho
##
## data: Utility[1:ntop] and y[1:ntop]
## S = 2000, p-value = 0.02354
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.5037594
##
##
## $income.gt30Kle110K
##
## Spearman's rank correlation rho
##
## data: Utility[1:ntop] and y[1:ntop]
## S = 764, p-value = 0.06138
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4255639
##
##
## $gender.female
##
## Spearman's rank correlation rho
##
## data: Utility[1:ntop] and y[1:ntop]
## S = 989.72, p-value = 0.2763
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.2558509
##
##
## $admission
##
## Spearman's rank correlation rho
##
## data: Utility[1:ntop] and y[1:ntop]
## S = 1362, p-value = 0.9198
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.02406015
I’ve copied idea and code snips from Tad Dallas’s script on Kaggle.com:
mortarBoardIcon <- makeIcon(iconUrl = "https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/Fxemoji_u1F393.svg/512px-Fxemoji_u1F393.svg.png",
iconWidth = 30, iconHeight = 30)
schools <- caseResult$topN %>% inner_join(student %>% dplyr::select(c(unitID,
College, city, state, zip, median_hh_inc_2005, lon, lat)), by = "unitID") %>%
mutate(info = paste0(College, "<br>", city, ", ", state, " ", zip, sprintf("<br> Rank: %d Utility: %.2f",
seq_along(Utility), Utility)))
map <- leaflet(schools) %>% setView(-93.65, 42.0285, zoom = 4) %>% addTiles() %>%
addMarkers(~lon, ~lat, popup = ~info, options = popupOptions(closeButton = TRUE),
clusterOptions = markerClusterOptions(), icon = mortarBoardIcon)
# print(map)
map
Finally, we’ll do a brief set of sensitivity analyses illustrating how the college rankings change with the student’s demographics and behavioral traits.
We generate barplots of the utilities of the top 20 colleges for specific student profiles. (You may want to just execute print(sensResults$grobList) at the console after running the script just to see individual top-10 school barplots in a legible format.)
#=======================================
### SENSITIVITY ANALYSIS:
# Define a function for one-variable-at-a-time analysis of the student profile properties:
#if(FALSE){
# paropts <- par(no.readonly=TRUE)
sensitivity <- function(studentBF,stdtProf0,pmap,StudentTrait,propertyMap,maxcnt=Inf,plot.it=TRUE){
#require(gridExtra)
propnms <- names(pmap)
cnt <- 0
caseList <- list()
grobList <- list()
for(propnm in propnms){
stdtProf <- stdtProf0
# par(mfrow=c(2,ceiling(length(pmap[[propnm]])/2)))
if(propnm == 'income') {
if(stdtProf$dependent) {
lvls <- names(pmap[[propnm]]$dependent)
} else {
lvls <- names(pmap[[propnm]]$independent)
}
} else {
lvls <- names(pmap[[propnm]])
}
for(levelnm in lvls){
stdtProf[[propnm]] <- levelnm
stdtProf$beta <- getParameters(stdtProf,propertyMap)
caseResult <- studentCaseStudy(studentBF,stdtProf,propertyMap,ntop = 10,verbose = FALSE)
profText <- c(stdtProf[propnm],stdtProf[!grepl(paste0('beta|traits|',propnm),names(stdtProf))])
pltTitle <- paste(strwrap(paste(paste(names(profText),profText,sep=':'),collapse=', '),width = 80),collapse='\n')
traitStr <- paste(paste(c('Risk','Vision','Breadth','Challenge'),stdtProf$traits,sep=":"),collapse=', ')
pltTitle <- paste0("Top 10 Colleges for Profile:\n",pltTitle,'\n',traitStr)
nms <- names(caseList)
caseList <- setNames(c(caseList,list(caseResult)), c(nms,pltTitle))
#if(plot.it){
gplt0 <- caseResult$topN %>%
mutate(labels = factor(labels,levels=rev(levels(labels))),
labelY = pmin(0,min(Utility))) %>%
ggplot(aes(x=labels,y=Utility,fill=labels))
gplt <- gplt0 +
geom_bar(stat='identity',position='dodge') + coord_flip() +
geom_text(aes(x=labels,y=labelY,label=labels),size=6,hjust=0,vjust=0.5) +
theme(text=element_text(face='bold',size=8),
axis.text.y=element_blank(),
title=element_text(size=12,face='bold'),
legend.position = "none") +
ggtitle(pltTitle) +
labs(x="Top 10 Colleges")
#gplt %>% print
grobList <- c(grobList,list(gplt))
#}
cnt <- cnt + 1
if(cnt>=maxcnt) break
}
if(cnt>=maxcnt) break
}
# par(paropts)
### UNFORTUNATELY, SEEMS TO LOSE CONTROL OF TEXT SIZE WHEN ggplots ARE LAID IN
### GRID USING PACKAGE 'gridExtra'.... HELP!!!
#gmulti <- marrangeGrob(grobList,nrow=2,ncol=4)
if(plot.it) do.call(grid.arrange,grobList)
return(list(caseList=caseList,grobList=grobList))
}
Perform sensitivity analysis for a stellar female student interested in Engineering and a Large City setting, with a trait profile of Risk=High, Vision=Low, Breadth=High, Challenge=High. We’ll sweep over regions.
Note that because Breadth is High, she won’t stick to strictly engineering schools. And because Vision is Low, she stresses region and locale more strongly than future earnings.
stdtProf0 <- students$White
stdtProf0$gender <- 'female'
stdtProf0$discipline <- 'Engineering'
stdtProf0$sat <- 'gt1400'
stdtProf0$traits <- StudentTrait$HrLvHbHc # Risk=High, Vision=Low, Breadth=High, Challenge=High
pmap <- propertyMap[c('region','ethnicity')]
#pmap$ethnicity <- pmap$ethnicity[1:4]
maxcnt <- length(pmap[[1]]) # Only flip ethnicity
sensResults <- sensitivity(studentBF,stdtProf0,pmap,StudentTrait,propertyMap,maxcnt = maxcnt, plot.it = FALSE)
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
print(sensResults$grobList)
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
Perform sensitivity analysis for a stellar male student interested in Engineering and a Large City setting, with a trait profile of Risk=Low, Vision=Low, Breadth=High, Challenge=High. We’ll sweep over ethnicity; then for fixed ethnicity=White, we’ll sweep over region.
Note that because Risk is Low, he’ll strongly stress region & seek student populations that look a little more like himself.
And in the case of Region = New England, the student’s top-10 even has schools with negative utility!
stdtProf0 <- students$White
stdtProf0$gender <- 'male'
stdtProf0$discipline <- 'Engineering'
stdtProf0$sat <- 'gt1400'
stdtProf0$traits <- StudentTrait$LrLvHbHc # Risk=Low, Vision=Low, Breadth=High, Challenge=High
pmap <- propertyMap[c('ethnicity','region')]
pmap$ethnicity <- pmap$ethnicity[1:4]
pmap$region <- pmap$region[c('NewEngland','Southeast','GreatLakes','FarWest')]
maxcnt <- Inf
sensResults <- sensitivity(studentBF,stdtProf0,pmap,StudentTrait,propertyMap,maxcnt = maxcnt, plot.it = FALSE)
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
## Warning in cor(rank(x), rank(y)): the standard deviation is zero
print(sensResults$grobList)
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
Now, let’s see what happens when that same stellar male student shows up with a trait profile with more risk and has a longer-term vision:
stdtProf0 <- students$White
stdtProf0$gender <- 'male'
stdtProf0$discipline <- 'Engineering'
stdtProf0$sat <- 'gt1400'
stdtProf0$traits <- StudentTrait$HrHvHbHc # Risk=High, Vision=High, Breadth=High, Challenge=High
pmap <- propertyMap[c('ethnicity','region')]
pmap$ethnicity <- pmap$ethnicity[1:4]
pmap$region <- pmap$region[c('NewEngland','Southeast','GreatLakes','FarWest')]
maxcnt <- Inf
sensResults <- sensitivity(studentBF,stdtProf0,pmap,StudentTrait,propertyMap,maxcnt = maxcnt, plot.it = FALSE)
print(sensResults$grobList)
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
#}
#------------------------------------------------------------
As you can see, the student’s value assessment, in terms of beta parameters (partworths), as determined by his or her behavioral traits and preferences strongly affects the top 10 high-utility schools for that student.
Experiment with different student profiles, such as lower SAT’s, and interest in alternative settings (locales other than Large City). The results are fascinating!
Even better, alter the beta parameters directly yourself in stdtProf0$beta….
Please, feel free to extend this model and tool. It’d be great to slap a quick Shiny interface on it so that a student can input his or her own demographics and criteria and quickly generate lists and charts detailing the student’s best college options.
Being posed in a probabilistic discrete choice modeling framework, this model represents lots of potential. It can, for example, be presented with individual student data – imagine the entire dataset of applicants to a university plus the indication of which applicants were accepted or not – and then, using tools such as the package rstan, we could estimate data-based model parameters beta suitable for identifying candidates for a specific college. This essentially reverse engineers the college’s admissions criteria.
It also offers a principled way of identifying which specific college is or is not a good choice for a student.
Even in its simplistic form shown here – which didn’t implement even some of the computed college properties – it’s quite entertaining to see how varied the rankings are given specific students – e.g., a top student (SAT>1400) with a strong regional preference makes quite interesting choices as the region of interest changes. Try it out!
Any feedback would be appreciated.
Thanks,
-Michael L. Thompson
Comments:
ALSO: To see the payoff, go straight to the Sensitivity Analysis section.
My intent was to render the R script as a PDF-format notebook using RStudio’s File–>Compile Notebook… menu command. But it was just taking too long for me to get all the formatting of plots, lists and tables right – so I bailed!
The R script version runs fine when sourced at the R console or in RStudio. In fact, that’s the only way I could get the
leafletpackage’s interactive mapping to display – until I changedprint(map)to just plainmap.Any help in cleaning up the formatting would be much appreciated.